It’s official… Data deduplication has been added to ZFS (read the link if you’re new to data deduplication). Hats off to Jeff Bonwick and Bill Moore who did a ton of the work in addition to Mark Maybee, Matt Ahrens, Adam Leventhal, George Wilson and the entire ZFS team. The implementation is a synchronous block-level one which deduplicates data immediately as it is written. This is analogous as to how DataDomain does it in their dedupe appliances.
What’s interesting about this is now dedupe will be available for *free* unless Oracle does something stupid. Sun’s implementation is complimentary to the already-existing filesystem compression. I’m not sure how much of an issue this is yet but the current iteration can not take advantage of SHA256 acceleration in the SPARC Niagara2 CPUs but eventually we should see hardware acceleration implemented.
When will it be available? It should be available in the Opensolaris dev branches in the next couple of weeks as code was just committed to be part of snv_128. General available in Solaris 10 will take a bit longer until the next update happens.
For OpenSolaris, you change your repository and switch to the development branches – should be available to public in about 3-3.5 weeks time. Plenty of instructions on how to do this on the net and in this list. — James Lever on the zfs-discuss mailing list
How do I use it? If you haven’t built an Opensolaris box before, you should start looking at this great blog post here. I wouldn’t get things rolling until dedupe is in the public release tree.
Ah, finally, the part you’ve really been waiting for.
If you have a storage pool named ‘tank’ and you want to use dedup, just type this:
zfs set dedup=on tank
Like all zfs properties, the ‘dedup’ property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis.
— Jeff Bonwick http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup#comments
What does this mean to me? Depends. For people who like to tinker, you can build your own NAS or iSCSI server with dedupe *and* compression turned on. Modern CPUs keep increasing in speed and can handle this. This is huge. Now, should you abandon considering commercial dedupe appliances that are shipping today? Not if you want a solution for production as this won’t be officially supported until it’s rolled into the next Solaris update. For commercial dedupe technology vendors, this is another mark on the scorecard for the commoditization of dedupe.
What things do I need to be aware of? The bugs need to be worked out of this early on so apply standard caution. READ JEFF’s BLOG POST FIRST!!! There is a verification feature, use it if you’re either worried about your data or using fletcher-4 as a hashing algorithm to speed up dedupe performance (zfs set dedup=verify tank or zfs set dedup=fletcher4,verify tank).