dropbox presumably pays for s3 slightly less or slightly more (depending on redu...

encoderer · on April 24, 2012

AWS rates are all negotiable, and large users always negotiate. Doesn't mean you're far off, but that's a point worth considering.

boyd · on April 24, 2012

Well, they're doing pretty heavy de-duping, but point well taken.

rdl · on April 25, 2012

The de-duping is certainly more than the discount off list price, too -- I suspect they de-dupe 10-100x. I really doubt their discount off the lowest public S3 volume price is more than 50%.

tlb · on April 25, 2012

Personal photos take up a large fraction of people's cloud storage, and they are completely non-de-dupable. I doubt it's possible to get more than 3x on average.

rdl · on April 25, 2012

Ah. I use stuff like this for binaries, music/video content, and other highly dedupeable content, at least relative to file size. I specifically don't use it for photos or video since I have specific technology for that, and random small files don't move the needle.

I didn't realize people used it as a photo or video sharing solution -- that kind of content would be big and non-dedupeable.

josscrowcroft · on April 25, 2012

[citation needed]

apu · on April 25, 2012

For which part? That photos take most of the space? Or that they're virtually un-de-dupable?

I'm fairly sure about the former (having seen some private numbers that I'm not allowed to share) -- although you can think through it yourself. What other kind of data is as easily and commonly produced as photos/videos and takes up so much space?

The latter is definitely true since it's a hard research problem that I've spent some time thinking about myself. There are approaches to lossy-de-duplication of photos that can achieve some significant savings, but the quality loss is too great to be useful at the moment.

andyking · on April 25, 2012

What a fascinating problem, especially when it comes to personal photos that aren't really that 'personal.'

A thousand people will all visit Paris today, and all take a picture of the Eiffel Tower, and all upload said picture to their favourite cloud storage platform.

Is there any reason why we need a thousand, ever so slightly different, pictures of the Eiffel Tower from one day in April, stored in the cloud for eternity? Would people even notice if their images were quietly 'de-duplicated'?

cscheid · on April 25, 2012

Reminds me of an art installation I saw somewhere with the "21st century camera" (can't remember if that was the actual name). It was a black box with a single red button, which when pressed captured the lat-long of the box and the current time, so you could search on picasa, flickr etc for geo-tagged pictures around that time. Fascinating to think about.

GiraffeNecktie · on April 25, 2012

"Would people even notice if their images were quietly 'de-duplicated'?"

If my wife's face does not appear next to the Eiffel Tower in the depulicated version I might be somewhat concerned.

Also photos are, at least in their highest form, an emotional response to light. One thousand good photographers shooting the Eiffel Tower at the same day and the same hour will probably generate 5,000 or more unique and interesting shots.

acdha · on April 25, 2012

Music: many people seem to use Dropbox to sync iTunes libraries (granted, iTunes Match is probably eating into that) but I'd easily believe that there are many large, duplicate media files from the same stores or torrent sites.

siculars · on April 25, 2012

That's actually not true at all. There are de-dup'ing techniques that do not simply rely on a sha1 of your entire file.

tlb · on April 25, 2012

Can you point me to a de-duping algorithm that works well for personal photos?

j_s · on April 25, 2012

I'll admit to not paying much attention, but they might count files shared between personal accounts against both accounts' quotas... there's certainly a huge incentive for them to do so on the free accounts. Meta-de-dupe!