Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nobody does this in an N^2 way. See simhash, minhash, etc

Trivial faster-than-N^2 algorithm:

1. Compute simhash of each issue - O(1) in N, the number of issues.

2. Sort the issues by hamming distance of simhash (N log N in the number of issues)

3. Pick the K issues before or after the current issue in the results (O(1) in the number of issues)

You can precompute/incrementally update any of these steps as issues are added. This would make find duplicates itself O(1) because it would just be step 3.

If you put this in term of size of text in the issues instead of number (which is what you used) it changes the time bounds, but, for example, it's not any worse than sorting/comparing the issues as text strings.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: