I was wondering - with the current amount of abstraction and similar (sometimes ...

viraptor · on Aug 27, 2011

Actually the btrfs email thread contained the answer (http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg0...):

"I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes."

It's a relatively small amount. Then again - you're storing 300MB of exactly the same blocks of data... Unless they're manual backup files, this looks like a big waste to me.

radiowave · on Aug 28, 2011

Yup, that's about the same proportion I found when I recently tried copying my data across to a ZFS system with the dedup switched on.

I then decided to disable the dedup, because it comes at a cost - the checksum data (which would mostly be living on the SSD read cache I had attached) was occupying about 3 times the monetary worth of SSD storage space than the monetary worth of conventional disk space that the duplicate data was occupying.

I noticed that the opendedup site (linked from the article) claims a much lower volume of checksum data, relative to number of files; perhaps an order of magnitude less than I observed with ZFS, but they seem achieve that by using a fixed 128KB block size, which brings along its own waste. (ZFS uses variable block size.) I haven't actually done the numbers here but I wouldn't be at all surprised to find that for my data, the 128KB block size would be costing as much disk space as what dedup was saving me. (YMMV, of course.)

dedward · on Aug 28, 2011

Just curious - were you using the verify option (not related to your point I realize)

I'm puzzled why people in general aren't more worried about data corruption due to hash collision.....

radiowave · on Aug 28, 2011

As it happens, in between first reading about ZFS dedup, and finally trying it out, I seem to have forgotten that the verify option existed. I just did "set dedup=on" - beyond that everything was whatever defaults you get on OpenIndiana build 148.

Were I to ponder that matter to any great depth, I suspect I'd find it rather difficult to get a handle on how concerned I ought to be about hash collision. Perhaps that's part of the answer to your puzzlement.

viraptor · on Aug 28, 2011

Because they believe in statistically low probability of that happening.

What they don't accept is that it might hit them in the worst time possible, that a backup of such information will also suffer the same problem, that after an upgrade an important part of the new kernel might have the same hash as the beginning of Morissette's "Ironic" mp3 you already store, etc.

But come on - it's only small probability...

ak217 · on Aug 28, 2011

I think this will be much more useful on large multi-user storage systems, e.g. the classic example of the 5 MB email attachment sent to 100 people.