Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In practice there is a certain kind of minhashing that is used in key-value stores, which is called a Bloom-filter.

Basically it is a probabilistic bitmap index for quickly telling if a certain elem is in a set. It is used in BigTable, Hbase etc.

http://en.wikipedia.org/wiki/Bloom_filter

The Jaccard similarity (or Jaccard index) mentioned in the article is used for finding similar items in a big set. E.g. Google News uses it for aggregating articles on the same subject.



Indeed, bloom filters work but, as with all things, do have limitations in practice. You have to make specific choices as to which type of bloom filter you will implement. You have to optimize for variables like n, m or k or you quickly begin to lose confidence in your responses. Also need to optimize for space vs speed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: