Looks really, REALLY, similar to this article: http://knol.google.com/k/simple-s...

yakto · on Feb 14, 2011

"Suppose you have a huge number of items that you would like to group together by a fuzzy notion of similarity. Suppose the only tool available to you is a key-value store. Suppose you only have the resources to consider each object once. Never fear, simhashing is here!"

yakto · on Feb 14, 2011

"It's been raining all day and you find yourself with a sufficiently large pile of items (tweets, blog posts, cat pictures) and a key-value database. New items are arriving every minute and you'd really like a way of finding similar items that already exist in your dataset (either for duplication detection or finding related items). Clearly we don't want to scan our entire existing database of items every time we receive a new item but how do we avoid doing so? Minhashing to the rescue!"

sadiq · on Feb 14, 2011

Heh, you're right, it does look pretty similar. I wasn't aware of that article until someone posted it on Disqus comments just now.

The article came out of my reading the data mining slides from the Ullman - Stanford course, maybe there's common terminology? I know I tried to keep it consistent with the wikipedia Jaccard Index.

hhjj · on Feb 14, 2011

How much Jaccard similar ?