Why The ZFS Copy On Write File System Is Better Than A Journaling One

preview_player
Показать описание
Article Referenced: Understanding ZFS storage and performance

CULT OF ZFS Shirts Available

Connecting With Us
---------------------------------------------------

Lawrence Systems Shirts and Swag
---------------------------------------------------

AFFILIATES & REFERRAL LINKS
---------------------------------------------------
Amazon Affiliate Store

UniFi Affiliate Link

All Of Our Affiliates that help us out and can get you discounts!

Gear we use on Kit

Use OfferCode LTSERVICES to get 5% off your order at

Digital Ocean Offer Code

HostiFi UniFi Cloud Hosting Service

Protect you privacy with a VPN from Private Internet Access

Patreon

⏱️ Timestamps ⏱️
00:00 File Systems Fundamentals
01:00 Journmaling Systems
02:00 COW
03:30 Copy On Write Visuals
07:49 ZFS with Single Drives
Рекомендации по теме
Комментарии
Автор

CULT OF ZFS Shirts Available

Article Referenced: Understanding ZFS storage and performance


⏱ Timestamps ⏱
00:00 File Systems Fundamentals
01:00 Journmaling Systems
02:00 COW
03:30 Copy On Write Visuals
07:49 ZFS with Single Drives

LAWRENCESYSTEMS
Автор

I suppose I fit into the ZFS fanboy club. I have extensively tested failure modes of various file systems, especially in RAID / RAID-Z arrays and I can give you a rundown of what I found:
1. UFS - A simple power loss can hopelessly corrupt this. Good thing pfSense is switching away from it. As an extra kicker when I took over the IT department at a facility where I saw pfSense fail to the point where it could not boot, I found that the UPSes were over 15 years old. I got those swapped out and pfSense configured to monitor the UPSes and haven't seen any more corruption issues. I also rebuilt the failed pfSense firewall to use ZFS before it was the default. There were other random corruption issues across multiple different pieces of hardware until those ancient UPSes were replaced, so I think when they clicked over to battery mode, they were throwing out dirty power and that may have also been why things corrupted.

2. ZFS - This is the closest thing to an indestructible file system ever made and its vertical integration is unmatched. If you just have one drive and no non-ECC RAM, while this isn't what is recommended for ZFS, ZFS is still going to do better than anything else out there in that scenario. Something important to note above and beyond CoW and snapshotting with CoW is the CRC blocks written for every commit. If CRC doesn't match, the commit is considered invalid and rolled back. The place non-ECC RAM can hurt you is maybe an otherwise good commit rolls back because the CRC got corrupted where a lesser file system will just write out the data and call it a day with no CRC to say whether the data is good or not. Most file systems don't care. ZFS does care. When it comes to ZFS RAID-Z, it is better to use RAID-Z than say hardware RAID. To avoid getting into all of the nerdy technical bits on why this is the case, let's just say it is the only RAID I tested where I couldn't find a weak spot to cause it to fail when there could have been a way to recover. Every other RAID I tested, and I tested many, I found a weakness somewhere where the whole RAID goes by-by with one or fewer failed disks or at least corruption with power loss. Of course RAID-Z level 2 is going to be a lot harder to break than level 1. If you really care about your data, you will probably use RAID-Z level 2 and maybe stick up to 8 drives in a RAID-Z level 2 vdev and then just have a bunch of vdevs in your zpool. The one annoying thing about ZFS is trying to add more drives to a vdev. You just add more vdevs, which is kind of lame if say you have 5 drives in a RAID-Z level 2 zpool and you want to add say 3 more drives to make it a single 8 drive RAID-Z level 2 array. Instead the best you can do, at least historically is have the 5 disk RAID-Z level 2 vdev and add a separate 3 disk RAID-Z level 2 vdev to the zpool. If you just want to add one drive for example, say going from a 3 drive RAID-Z level 1 to a 4 drive RAID-Z level 1, forget it; it won't work.

The really cool thing with ZFS is the vertical integration. All you need is ZFS. No LVM. No LUKS. No partitions. Just ZFS talking straight to the disks. It actually works better this way, especially if you are using SSDs and you want TRIM support. Seeing I like to encrypt data at rest, just create an encrypted dataset in ZFS and your data volume is encrypted. Easy peasy. None of this container inside of container mess to deal with, which becomes an unweildy problem, especially if you need to make something bigger or smaller or hoping some TRIM command will make it to your SSD so they don't get slaughtered with write amplification. Actually I just bit the bullet and use high endurance SSDs with RAID because lesser SSDs just get killed, but I don't do ZFS directly to SSD arrays yet. That is to be a future project when I am ready to do more with booting off of ZFS arrays directly as opposed to using ZFS arrays more for just data.

ZFS arrays are easy to add and remove from a system.

3. BTRFS - This is more of a wannabe ZFS for Linux that does away with the CDDL license causing Linus Torvalds to hate on ZFS. I use it and it does some good stuff, but it just is not as good or feature filled as ZFS. You can technically RAID with it and at this it is the best GPL software RAID out there, but it has weaknesses. BTRFS needs tweaking and balancing and such to keep its performance from going completely in the toilet with heavy use. BTRFS can corrupt more easily than ZFS, though it is much harder to corrupt than any journaling file system. BTRFS can handle a reasonable amount of drive error / failure issues with RAID as in you catch the drive errors early on and swap out the bad drive and rebuild and it is OK. You can even abuse BTRFS RAID some and if you know your stuff recover the array. However you start pushing it and BTRFS will just crash. You can end up with your array deeply hosed either with really bad hard drives or abuse in just a certain way. These exact same abuse tests ZFS passed with flying colors, so it can be done. In other words I am saying BTRFS is pretty decent and better than anything else that is GPL, but the CDDL licensed ZFS takes the crown in robustness and performance for this category of file system.

A place where BTRFS does have an advantage over ZFS is you can more easily do migrations with it. You can go from no array to an array. You can go from a few drives in your array to more drives in the array. You can also append multiple drives and arrays together to make it bigger. ZFS historically has only done the last one of adding multiple arrays together, no expanding existing ones directly. Granted this gap is closing in more recent versions of ZFS.

BTRFS arrays are easy to add and remove from a system.

4. XFS - This is probably the most robust of the journaling file systems. You can still screw it up and you need a separate RAID mechanism if you are using RAID.

5. EXT4 - This is probably the most performant of the journaling file systems. Journaling file systems are a lot faster than CoW based file systems. However you will still get your data corrupted in power loses and you don't get snapshots directly with it. Once you have snapshots, especially if you care about your data and want to back it up and such, you just can't use a journaling file system anymore, so EXT4 just won't do you. Not that it is a bad file system; it is just no good for a system where you care about your data and want to have good backups.

6. MD RAID - Software RAID that I consider a worse solution for storing your precious data than just a single drive. Power loss can cause write hole corruption and even master boot record corruption to the point where the array is unmountable from a simple power loss.

7. Supercap backed MegaRAID RAID - This is usually pretty good. If a disk breaks, you get an audible alarm, granted someone goes into the room where the card is and hears it. Can also setup monitoring to tell you if a disk broke. It is fast. It recovers from power loss well. It can rebuild from a drive failure quickly. Obviously RAID-6 is going to be more reliable than RAID-5.

When it comes to expansion, you can migrate your RAID easily, at least if you are not using RAID 50 or RAID 60. Really though this higher aggregation works well with BTRFS and ZFS and can be done with LVM, so you would do that at a higher level than to use the RAID controller for that. Anyways if you end up with more than one RAID controller, appending / RAIDing the arrays together at the higher level may be the only way.

Where the MegaRAID controller will get you is removable arrays. It is just not designed for this. Also say the controller breaks and you need to move the drives to another identical controller. If anything at all goes wrong or you just don't use the right commands, the RAID controller will wipe out the drives and start over. You also can't mix and match different storage modes. Either the controller is in RAID mode or it is an HBA (host bus adapter) for all drives connected to it. It doesn't understand anything else. And RAID mode doesn't get you direct access to the drives, which will really mess you up if you try to make single drive 'RAIDs' so ZFS can have access to individual drives as then ZFS doesn't know the health of the drives and things can go really sideways when a drive malfunctions. With SSDs, the TRIM is gone. 3ware would let you mix and match, which was really handy, but MegaRAID doesn't have a concept for this. It is just either running the RAID firmware or the HBA firmware and that is it.

ChaJ
Автор

"In the beginning Sun created the zpools, datasets and zvols. And the data was without form, and void; no checksum was to be found. And Sun said, Let there be Copy on Write: and there was Copy on Write. And Sun saw the CoW, that it was good..."

andarvidavohits
Автор

More than 10 years ago, I have been using ZFS for securing important data. More than 10 years later, that data is still with me till this day. Never looked back, never looked at other file system, never looked at other off-the-shelf NAS system. I am a ZFS fanboy/cult, if you like to call it.

TanKianW
Автор

I do agree that Ars technica's ZFS 101 article is really great. I read this a while ago, and I still think it's a must read for someone who want to better understand the basics of ZFS.

nedam
Автор

Far too many tech people & business leaders fail to comprehend that DATA is the life-blood of their business. Protection of that data is goal#1 (or should be). To that end, the business should always be using the BEST tool(s) to store & protect that data!!

PeteMiller
Автор

The COW feature of ZFS is how Datto's backup tech works. Rather than synthetic backup rollups (reverse incrementals) like Veeam and other backup vendors use, they don't have to reconstruct the data first thx to the way they use ZFS/COW. But it also makes their backup solution way less ubiquitous since you need their appliance, but it's pretty cool the way their tech works

alienJIZ
Автор

I am a zfs fan boy !

Thank you Tom. Your knowledge is INVALUABLE!!

dezznuzzinyomouth
Автор

Very nice video! It’s amazing all these products out there touting their magic sauce and you watch the boot screen and see ZFS volumes loading. It’s like pulling back the curtains on the wizard.

rolling_marbles
Автор

ZFS Fanboy here, I have been using ZFS since 2006 on Solaris when it first came out, and using it on Linux (Ubuntu) since 2016.
Having lost all data on several 4 drive NAS systems due to disk failures under RAID 5, I now only use RAID 1 ZFS mirrors for data on these small systems, what I lose in capacity I more than make up with resilience and speed (read) and self healing capabilities. Daily, weekly, monthly, yearly snapshots of volumes or zpools makes data loss a thing of the past (just go looking in your snapshots for the lost file and copy it back out) at very low disk overhead I then suplement this with additional off host backups.
Migration to larger disks and new zpools with zpool send/receive while using some additional cabling is again very simple. Once done, deport your old zpool, swap the disks over and import the new one with the old name and off you go again on the bigger disks.
I have also auto cloned Oracle databases to separate servers on SAN storage to refresh them on a daily basis, initial creation using zpool split, then new snapshot and zfs send/receive to update every night.

zebethyal
Автор

I'm fully in support of the ZFS cult. We have it on freenas and probably a couple other boxes that my partner set up that I don't pay too much attention to. I'm a btrfs fan myself, because it's the default for my distro of choice and it sure seems like it works well.

IAmPattycakes
Автор

Thanks for this! I especially appreciated the explanation of how snapshots work in ZFS. I love to understand how stuff like this works before I use it, and this was a great "light bulb moment" for me.

GodmanchesterGoblin
Автор

You are doing so well at explaining technical concepts. I‘ve been watching lots of your videos, even completely unrelated ones, and it‘s been always loads of fun.

hippodackl
Автор

Saying that snapshots take no space seems patently wrong, since the blocks that have been replaced remain locked/marked as used for the purpose of the snapshot. So while no space is allocated at the time of the snapshot, any additional data changes to the snapshotted data will occupy additional blocks, rather than freeing up blocks as new data is written.

KentBunn
Автор

Great visualizations! ZFS is by far my filesystem of choice, except for cases where there are a lot of small writes (like a relational database). I have my VM boot drives on separate ZFS datasets, and can do nightly incremental backups to a TrueNAS box in seconds, and keep the ability to roll the VM's back to any earlier date for as long as I want.

haakoflo
Автор

I know it's a little late, but thank you for this vid.

semuhphor
Автор

I primarily like ZFS for bulk storage (BTRFS can't really compete here in my opinion), as I wrote in my previous post, I like BTRFS for my workstation root filesystem.
Nothing prevents me from using both, so I do :)

Kulfaangaren
Автор

My Synology had a kernel panic and lost an entire BTRFS volume as a result (and yes, I am religious about my backup plan). That was enough to inspire me to do the homework and make the move to TrueNAS Scale to see how ZFS works out for me.

larzblast
Автор

There actually is some data correction with single device CoW. HDD's and SSD's use ECC in their firmware when they write and read. If a single bit flips while your data is at rest, then you go to read it, the ECC can correct that, repair it, and re-write it and/or relocate it. Depending on the level of ECC, maybe even more than one bit, I'm not sure. But as more time passes, you may end up with more bits flipped in a sector to the point where ECC can't correct it. That's when you get an actual read error. Filesystem checksums come into play at that time, or when there's a transport layer error between the drive controller and the filesystem driver. Then the FS either errors (with one copy of data), or issues a write of the data copy that does match its checksum, hopefully fixing the bad data - if the device is able to write good data.

glockguy
Автор

Are "a cult with data integrity" shirts coming to the store? I'll take 2!

rcdenis