Dear canonical: we don’t want or need ZFS

It is the late 1990s and the computer server world is dominated by enterprise UNIX operating systems – all competing with each other. Windows 2000 is not out yet and Windows NT 4 is essentially a toy that lesser mortals run on their Intel PCs which they laughingly call ‘servers’. Your company has a commercial UNIX and its called Solaris. Your UNIX is very popular and is a leading platform. Your UNIX however has some major deficiencies when it comes to storage.

IRIX – a competing proprietary UNIX – has the fantastic XFS file system which vastly out performs your own file system which is still UFS (“Unix File System” – originally developed in the early 1980s) and doesn’t even have journalling – until Solaris 7 at least (in November 1998). IRIX had XFS baked into it from 1994. IRIX also had a great volume manager – where as Solaris’ ‘SVM’ was generally regarded as terrible and was an add-on product that didn’t appear as part of Solaris itself until Solaris 8 in 2000.

It wasn’t just IRIX that was extraordinarily better in this area. AIX was ahead of both – JFS was released in 1990 and had file system features that were only just recently introduced by Microsoft with ReFS. JFS was a journaled file system – the first ever journalled file system included in an OS – as I mentioned above it took until November 1998 for Sun to catch up. AIX had a “Logical Volume Manager” (LVM) implementation as well, which again was much better than Sun’s SVM.

This disparity between Solaris and other commercial UNIX platforms did not however hold Solaris’s market share back as it perhaps should have. This was because customers using Solaris on big high-end servers would simply not use UFS, especially not between 1998 and 2005. Customers used VxFS instead – a third party file system, but one that was developed originally at AT&T’s Unix labs, one that was the core file system in another proprietary unix – HP-UX – and one that had journalling, was modern, and could actually compete against XFS and JFS. Of course customers had to buy this from Veritas, but this was a small price to pay for a decent file system and volume system (yes, it came with an alternative volume manager too – Veritas Volume Manager).

So eventually Sun realised that storage was exploding in capacity and UFS just wasn’t up to the task. They also realise that VxFS wasn’t likely to be up to the task either, and with the growing threat of Linux and Windows Server a different solution was needed – a file system to fix all the problems with UFS and leap-frog the competition. As a young man I was fortunate to work at Sun Microsystems when this was happening and I got to meet the core ZFS developers and even work in the ZFS ‘development lab’ – I worked in the same building.

Sun had a problem though – they didn’t just need a new file system. Their RAID implementation (entirely in software, Sun servers never had hardware RAID), and volume management implementations also needed to be replaced. So ZFS sought to replace all three of these components at once. Sadly it would take until 2006 for ZFS to be released into production usage on Solaris, and by then the battle for the enterprise operating system was already over. Linux and Windows had won, the commercial UNIXes had lost. Intel had won the hardware war – the commercial UNIX vendors had lost. Perhaps file systems weren’t as important as Sun had thought.

ZFS is a child of the 1990s commercial UNIX systems. It is an incredibly complicated file system that manages the entire storage stack from the disk spindles all the way up to the file system exposed to application. It can manage vast quantities of disk spindles and scale up to 16 exabytes of storage. It is however still very much a product of the 1990s Sun thinking – a file system for large, beefy all-in-one servers running hundreds of applications. The world had however moved on whilst Sun wasn’t watching.

By 2006 the dominant server platform was 1 or 2U Intel server running Linux or Windows 2003 – servers that almost universally shipped with hardware RAID controllers. High-end SAN storage arrays were king in the enterprise and ZFS wasn’t built for them at all – ZFS was designed to manage the disks directly, making it a great platform for building a SAN storage array itself! Except it wasn’t, because ZFS was still designed with a 1990s mindset. It has no clustering support, its a local file system designed for just Solaris to utilise.

The goal of ZFS to allow Solaris to compete and address vast swathes of storage that UFS and the other competing file systems could not. However by 2006 when ZFS was finally released the other file systems had caught up. They had evolved to scale to the available storage. For a short while everybody talked about how Linux needed ZFS, how Mac OS X needed ZFS, and how ZFS could even turn up in Windows. Ten years after ZFS was launched none of those things have turned out to be true.

Even more frustrating for ZFS fans is that today the dominant computing model is virtual machines and containers: lots of relatively small operating system instances utilising relatively small data sets working together. ZFS makes very little sense in this environment.

Proponents of ZFS on Linux and elsewhere said that ZFS was required because it was revolutionary and much better than what Linux had. In some cases this was true, but in the important cases it was not. Linux was then, and still now, mostly run on hardware RAID systems, had fantastic simple and reliable software RAID, a performant and simple volume manager (LVM) and a range of file system choices that scaled to what was available then and now. Linux was gifted both XFS and JFS from Solaris’ rivals – and both of which continued to develop, XFS particularity so.

Linux did lack some features of ZFS – namely efficient snapshots and data checksumming – that were important. Ten years later we can clearly see that these issues did not prevent the adoption of Linux and ZFS did not in any way save Solaris – Solaris is dying, slowly in private, away from the eyes of the press. Linux won, despite not having ZFS (or Dtrace).

So what about today, does Linux needs ZFS? Canonical thinks it does, and thinks ZFS is exciting technology – more exciting than we’ve seen in “a long time”[1]. Except it really isn’t. These are the same arguments we heard 10 years ago and yet ZFS is even less relevant today than it was a decade ago. Canonical tried to justify ZFS with a series of ‘killer’ features:

  • ‘snapshots’
    Linux already has copy on write snapshots via LVM thin provisioned pools, its in production and supported in RHEL. What’s more it supports most Linux file systems – you and choose whichever you like. If you prefer you can dump LVM and use btrfs which supports snapshots in the same way. So no, sorry canonical, this is not a killer feature of ZFS.
  • ‘copy-on-write cloning’
    ZFS clones are just writeable snapshots, it snapshots ZFS and then copies this (via COW) to create a writable clone. Well, shucks, Linux’s LVM supports this as well and has done for years. It also is a COW based system. Oh and btrfs does this too. This isn’t a killer feature of ZFS either.
  • ‘continuous integrity checking against data corruption’
    XFS has metadata-only (non-data) integrity checking too. Btrfs has full data integrity checking against data corruption. So, no, ZFS can’t claim this is a killer feature that others don’t have. It doesn’t matter anyway – this continuous integrity checking means nothing if you’re using ZFS on a hardware RAID controller or against an enterprise (or non-enterprise) storage array. It only works and guarantees anything if you’re letting ZFS manage the spindles directly. This was a product of 1990s thinking about how storage would be attached or baked into Sun’s servers. Besides, when was the last time you got data corruption? What problem is this trying to solve? I’ve never felt that Linux needs this feature, have you? This isn’t a killer feature.
  • ‘automatic repair’
    Whilst it is true that ZFS does not have to run a horrible NTFS-style chkdsk process, or a horrible ext3-style fsck process either, other file systems have progressed in this regard too. XFS has a similar automatic repair function, doesn’t ever run fsck at boot (there is no XFS fsck!), and does have an xfs_repair tool that nobody ever has to use. Its also worth pointing out that ZFS does have to have non-automatic repairs sometimes, in fact, I’ve had to do it a lot when running ZFS in production. ZFS scrub’s are…not fun, ridiculously slow and can lose files just like any file system does. I found this in production multiple times. Oh, and btrfs supports ‘automatic repair’ too. This isn’t a killer feature.
  • ‘efficient data compression’
    I think this is the only feature that has any merit in Canonical’s list, but I cannot call it a killer feature. Work is ongoing on adding compression into ext4, but nobody seems to care much about doing it. If you really want it its baked into btrfs on Linux. So no, canonical, this is not a ‘killer’ feature.

ZFS – and sadly btrfs – are both rooted in a 1990s monolithic model of servers and storage. btrfs hasn’t caught on in Linux for a variety of reasons, but most of all its because it simply isn’t needed. XFS runs rings around both in terms of performance, scales to massive volume sizes. LVM supports XFS by adding COW snapshots and clones, and even clustering if you so wanted. I believe the interesting direction in file systems is actually things like Gluster and Ceph – file systems designed with the future in mind, rather than for a server model we’re not running any more.

Canonical are targeting ZFS support for containers, saying that its the perfect fit for that. The irony is containers don’t need ZFS. Red Hat is using a LVM/devicemapper CoW based approach. CoreOS has switched away from btrfs (ZFS-style snapshots!) to overlayfs and ext4 – and apparently performance was much better. Docker can use OverlayFS as well and recommends against using ZFS.

Of course, Ubuntu suffers from NIH syndrome, and so isn’t using Docker/Atomic/rkt etc – it has created its own container technology – LXD. Perhaps it doesn’t matter then that you’re using OpenZFS – if you’re planning on using LXD (and thus ZFS) you’re ignoring the mature, established container technologies and picking a platform that is almost certainly going to be poorly supported going forward.

17 thoughts on “Dear canonical: we don’t want or need ZFS

  1. > ZFS scrub’s are…not fun, ridiculously slow and can lose files just like any file system does.
    I think they’re awesome. I have not lost a single file in almost a decade thanks to ZFS. Scrubs run in the background, you walk away, the server continues to work, and after a period of time it finishes. Personally, I would describe 200MB/s scrub throughput as blazing fast, especially as it’s running in the background. Waiting for an fsck to finish before anything works is what’s not fun. If you run Raid-Z2 direct on the hard drives you will not lose data.

    > this continuous integrity checking means nothing if you’re using ZFS on a hardware RAID controller
    That’s kind of like saying a car’s airbag means nothing if you drive it into the Grand Canyon. The whole point of ZFS is it obsoletes hardware raid. There have been hundreds of posts on the ZoL and OpenSolaris mailing lists on staying away from hardware raid or how to replace the firmware on drive controller cards to be non-raid. This is why you think ZFS is so slow; You’re blaming ZFS for your hardware raid controller that’s just getting in the way and corrupting your files. The CPU in your server that ZFS uses is much faster.

    IRIX was the most god awful OS ever. IRIX was absolutely inferior to Windows NT 3.51 – At least NT on Dec Alpha. IRIX typically crashed a dozen times in a day. I guess you could blame the $90K SGI Indigo Extreme II, but other than getting a model with less RAM, or waiting till the Indy came out a few years later, this was the cheapest machine to run IRIX on. The biggest thing SGI/IRIX machines had going for them was they were beautiful, not XFS.

    Maybe 8 years ago I was using Solaris solely to get ZFS. While Solaris was an OK OS, I was more than thrilled to switch over to Linux about 5 years ago (KQ Infotech did the first Linux port) for ZFS. Two years ago I have switched my ZFS server needs over to FreeBSD. My point is ZFS is so amazing that I pick the OS based on how well it runs ZFS. I can use any UNIX OS, but I can’t live without ZFS.

    There’s a reason why the Sequoia supercomputer (originally the faster computer in the world, now I believe #4) uses ZFS.

    Liked by 1 person

  2. Wow. Just wow. The ignorance in this post is astonishing.

    I *have* seen data corruption, on SSDs no less, that was caught and saved by ZFS checksumming. And its not just that ZFS checksums, but *how* it checksums that is so important. (Disks checksum too, but their method means that they can return “correct” data, but from the wrong block, giving you the right answer to the wrong question!)

    Traditional RAID systems using parity based RAID are subject to something known as the RAID-5 write-hole. Look it up. ZFS’ approach to merge the the filesystem and volume management makes it more intelligent, and it is far more resilient.

    Compression *is* a killer feature, not just because it saves disk, but because it *improves performance*. The cost of doing the LZJB or LZ4 is typically less than the cost of the I/O savings gained. (The only time not to do this is when working with data that is already compressed.)

    ZFS snapshots and ZFS streams allow for very elegant backup/restore, replication, and DR. Can your ext4 do that? Didn’t think so.

    If you’re losing data in ZFS scrubs, its indicative of one of a few things, all of which underscore why ZFS is so important. A) bit rot. This is an attribute of physical media. Running a scrub lets you identify degradation in the media, and correct it — but *only* if you have sufficient redundancy. B) If you’ve not run it recently, data sitting in storage can rot sufficiently badly to be irreparable. Other filesystems don’t inform of you this; with ZFS at least you know that your data is lost. C) Lack of sufficient redundancy. If you have RAIDZ (any type), then your data will be *repaired* during a scrub. If you don’t have a second copy of the bits anywhere, nor sufficient redundancy in the error codes (ZFS can correct single bit errors on its own, even absent redundancy), then ZFS cannot help you. You can’t fabricate correct data out of incorrect data. All of this points to the fact that if you have ZFS scrub “causing” lost data, that you’ve operationally failed to use ZFS to properly protect your data… by not running scrub often enough and by not using any redundancy in your pool. You cannot blame this on ZFS — in fact any other filesystem would actually perform worse here, because the data would still be lost, you just wouldn’t know about it! (Silent data corruption.)

    XFS is a journaled filesystem, and as such yes, it means you don’t need to do fsck (or fsck is really cheap). It actually compares quite naturally against journaled UFS – which is something Sun had *before* ZFS. It also makes some bad mistakes — in the name of performance it puts the journal for data in the same disk block as the data. This means that a failure that takes out the block is going to take out both the journal *and* the data. It uses a physical stripe width for use on RAID, which UFS can do too. XFS blows away older ext2 (which was the other filesystem prevalent at that time), and non-journaled UFS, but it has nothing to remotely compare with ZFS.

    btrfs is a much closer comparison to ZFS, and was inspired by ZFS, and they are both true CoW filesystems. Sadly, btrfs is simply still not mature enough. It took a lot of inspiration from ZFS, but frankly lacks the investment to mature, and their unwillingness to break the abstraction boundary means that there are things that btrfs will *never* be able to do. Arguably the ongoing investment in ZFS means that btrfs will never catch up to ZFS — ZFS has more mind share and a larger group of developers. The only thing btrfs has going for it that ZFS doesn’t is the absence of a license conflict for direct inclusion in the Linux kernel. In every other meaningful metric btrfs falls short. (I recall not that long ago when btrfs developers talked about “big” filesystems of about 30-40GB. I know of ZFS pools of ~1PB in production deployments. You can even buy storage appliances from Oracle that scale to 1.5PB for a single pool.)

    Now ZFS has other features that folks often forget to mention. Hybrid pools mean you can use SSDs sparingly to greatly improve the performance of your system without having to go all flash. (See L2ARC and SLOG). The ZFS ARC algorithms mean that you get much much better cache utilization than traditional buffer cache, particularly because the cache is resistant to the effects of the random odd thread that e.g. accesses every file on the disk. (MFU *and* MRU based caching.)

    ZFS does have sharp edges (dedup!) but used correctly some of those can sometimes be killer features.
    (I’ve seen cases where dedupe saved tens, and in one case hundreds, of *factors* of storage — e.g. a dedupe factor of 238x. Admittedly these were somewhat contrived virtualization cases.) ZFS is probably the single best backing store for virtualization, between its support for CoW snapshots capabilities like dedup.

    Comparing cluster / distributed filesystems to ZFS (or btrfs or ext4) is like comparing airplanes and cars. They have totally different jobs, and you can do things in one that would be impossible in the other. In fact most cluster based filesystems use a local filesystem — often ZFS!! — as their local on-disk backing format.

    ZFS is not right for every use case, for sure. But for a vast number of them it is without peer.

    I cannot imagine trying to operate a modern cloud system without ZFS, that is for sure.

    Liked by 2 people

  3. “I cannot imagine trying to operate a modern cloud system without ZFS, that is for sure.” – and yet 99% of the cloud do so without issue. Thanks for your comments, but I think you’re too biased towards ZFS to have a fair opinion of it.

    Like

    • You could replace “I cannot imagine trying to operate a modern cloud system without ZFS, that is for sure.” with “I cannot imagine trying to operate a modern cloud system without ZFS or BTRFS, that is for sure.” and have an “unbiased” opinion.

      Frankly, I feel like you were the one biased – mixing hardware RAID with ZFS and then complaining about losing data? For real? And saying OS X don’t need ZFS when Apple tried to replace the horrendous aging HFS+ with it (and then gave up due licensing, not technical reasons) and now is developing it’s own ZFS copycat, APFS is pure blasphemy to say the least.

      Also, there’s nothing preventing you from using different filesystems under ZFS or BTRFS pools. ZFS and BTRFS make managing volumes not only more reliable (checksum is there to protect us from evil bit rot) but also way easier. LVM and hardware RAID got obsoleted IMHO.

      It seems like QT vs GTK is happening again with ZFS and BTRFS. Neither BTRFS or GTK would exist if ZFS and QT licenses matched the GPL likings. This is far more political than technical.
      Sure, Canonical could have gone with BTRFS to please GPL purists but why would they? They are a corporation after all, it only makes sense to use something that’s “done” and working rather than wasting resources that could go elsewhere helping reinvent the wheel.

      I don’t think the point you made on your article is very valid. I could agree on disagree if you suggested BTRFS over ZFS, but what you said is basically walking backwards.

      Sorry if I sounded harsh, I really liked the background and how well written this article was.

      Like

      • I wish I could say something nice about btrfs here — but at this point in the game I think there are still too many limitations with it, and admittedly I’ve been unwilling to trust it in production so my experience with it is limited. That said, I hear it’s gotten vastly better now. I think a few areas where ZFS still has it beat are architectural in nature (raid5-write-hole, hybrid storage, dedupe), and thus will always be areas where btrfs falls short. The btrfs parity raid support is reported still horribly broken with “serious data-loss bugs in it” (https://btrfs.wiki.kernel.org/index.php/RAID56). So no, I don’t think I’ll be running btrfs in production at scale with any data I care about any time soon.

        That said, for boot disks or a home server (with media backed up somewhere else), btrfs looks better. It’s ability to relayout data (probably where some of those data loss bugs lie!) looks very helpful for the home user. (End users with small arrays are the ones most likely to wish for the ability to increase redundancy or extend an existing array. ZFS demands that you configure your pool properly from the beginning, and since you can’t change the RAID level on pools. This is usually a non-issue for enterprise users.)

        The average home user probably doesn’t care about any of this, and can get by with ext4. Until they suffer data loss. (But that’s what dropbox and time machine et. al. are for.) Let’s face it, if people can live with Apple’s horribly broken HFS+, they can probably also get by with ext4. (In fact I do use HFS+ on my Macs, but I make sure that all of my key data is replicated somewhere in the cloud, because I *have* lost data due to drive failure on Macs, as recently as just a week ago.)

        Like

    • Btw, your estimate of “99%” of the cloud do so without issue is uninformed. The cloud providers I’ve worked at or with all do use ZFS in various places in their org. The ones that don’t generally run some kind of distributed object store instead, and don’t rely on advanced “local filesystems” at all. (Settling for basic ext4, and using none of the LVM or other more advanced features of it.) Even then, a fair number of those still use storage appliances that are based on ZFS or other very sophisticated and proprietary filesystems under the hood (e.g. WAFL on data served off NetApp, or EMC VNX or whatever.)

      Like

      • So you’re telling me that Microsoft, Google, Amazon, Linode, Digital Ocean all use ZFS for primary data storage?

        But you’ve argued my point: “The ones that don’t generally run some kind of distributed object store instead, and don’t rely on advanced “local filesystems” at all.”. That was my whole argument. That ZFS was solving a problem that by the time it was solved was no longer relevant.

        Like

  4. ‘efficient data compression’

    The only reason to use ZFS. Also, your “mature” container solutions only exist because LXC wasn’t mature enough to be usable by anything other than robots. Docker is deprecated now that the utilities exist for containers to be managed by the OS natively. I wish you could have seen this demo at the Openstack summit in May. It would change your mind if you understood it.

    Plus, tacos. 🙂

    Like

  5. show me a filesystem with checksum of all data (not just metadata) that’s as vetted as ZFS. this is an essential feature for any server. so you can complain, but please, give an alternative.

    there is the non-production-ready BTRFS. and… nothing else.

    Like

    • It is not an essential feature, if it were, every other file system would have added it. We’d be hearing countless stories of data corruption. It is a solution looking for a problem.

      Like

  6. “Besides, when was the last time you got data corruption?” It’s probably already happened to you and you didn’t notice. I am incredibly disorganized, so when I try to consolidate duplicate copies of large WAV files and files that should be identical aren’t, I realize it’s because of data corruption.

    You say the only alternative that can provide corruption detection is Btrfs, which is “under heavy development”.

    Doesn’t sound like there’s any other solution yet.

    Like

    • This is a silly argument – “Bad thing is happening to you, you should care, but it isn’t affecting you at all, but you should care!”. Do you not see the flaw in your logic? I’ve yet to find corruption in files stored on Linux file systems that occurred ‘silently’ (or loudly!).

      Like

  7. LMAO !! .. ZFS is used by more than 50% of the largest enterprise customers, including Google, Amazon, etc !
    ZFS Appliances are hands down MUCH faster than NetApp appliances, and offer direct block storage FC, Infiniband, 10gEther/iSCSI beyond NFS.
    You don’t realize that Sun Microsystems who invented NFS, invented ZFS within Solaris.. which also invented Containers (from BSD Jails), Java, and soooooo.. many of the best standards in use today.
    As a filesystems, ZFS is MUCH more.. also negating the need for Volume Management.
    But there’s much much more ! .. it also gave us the FIRST Hybrid Storage Pool technology, where you can automatically allocate Memory, NVMe/SSD, and Slow HD’s to various portions of these storage pools and ZFS makes it auto tuned (via ARC Cache and L2ARCH caches you can cache READ or WRITE data…).
    On top of that, there’s no limit to the filesystem size, AND a guarantee of NO corruption with RAID-Z.
    While rumors of Solaris’s demise are rampant, the truth is that Solaris is still being developed, and rolled out with S-next =S11.4 coming in 2018 (continuous development model like Mac, MS, ..), which includes hundreds (nearly ALL Linux FOSS runs and comes with the Solaris distro!!). Oracle also stands by the OS noting it will support Solaris until at LEAST 2034 !!
    With Solaris (the world’s best Enterprise grade, and most secure OS OOTB), you don’t need to reboot your boxes, or worry about panic’s (I have seen many a system running for > 5 YEARS without reboot !!).
    Linux is cute as a free tinker toy that requires a TON of integration/cobbling together dozens upon dozens of components to complete an OS, but WHY spend that much TIME and $$, then also for VMware virtualization $$$.. etc.. when it’s ALL FREE when you buy Solaris !???
    Answer : The lemmings stick to what they know and learned in college.. and just like people flocking to Mac’s and iPhones (still much inferior to Samsung/Android), trendy panacea’s normally AREN’T the best solution, especially when it comes to Enterprise Class reliability, security and TCO to run an organization on !

    Like

    • I worked at Sun Microsystems, but thanks for the comment 😛
      Google uses their own storage technologies, but I’d be happy for you to cite evidence to suggest Google use ZFS widely (or at all).
      Amazon are famously reticent to release information on what powers their infrastructure, and as far as I am aware, they have never stated they use ZFS. Again I am happy for you to cite evidence.
      Please also cite evidence for saying ZFS appliances are ‘faster’ than NetApp. Either way I’m not sure why this is relevant since there are plenty more solutions in the marketplace which are scale-out solutions and active/active cluster solutions, which ZFS does not support.
      I’m also not sure what relevance containers or Java are to ZFS… but sure I applaud Sun for creating Java. Containers wise – claiming they invented containers is quite amusing to me.
      I’m happy Oracle will continue to support Solaris. It isn’t anywhere near as good as Linux anymore sadly, and its far too expensive for businesses to use compared to the alternatives, but its still a solid platform.

      However to claim Linux is a ‘cute free tinker toy’ is quite amusing to me I must say. Enterprises are built on Linux. Amazon and Google (and Oracle!) rely on Linux for all their mission critical systems. Linux powers Android. It powers most stock exchanges. It powers most military warships. Anyway there is little point in debating this point given in 2017 you un-ironically called Linux a toy OS.
      In ‘college’ I learnt Solaris. All the systems the university’s department of computing used were Solaris. I used Sun Ray thin clients for most of my work. I then worked at Sun Microsystems. They didn’t teach Linux at all. What do you say to that? 😉
      p.s. I love the irony of you saying Android is better when Android is Linux 😉

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s