Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added.

other way round? bloom filter provides a low probability of erroneously believing a URL had already been added when it had not, zero probability of believing a URL had not already been added to the filter when in fact it had.

using a bloom filter in this way guarantees you won't ever hit a page twice, but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.



> but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.

My (not-very fresh) memory of what bloom filter actually is tells me that this "non-zero rate" you're talking about must be HUUUUGE. In order of millions of pages. Am I right?


You're right if OP was using a small bloom filter, and wrong if it was a big one. Hence, the phrase, "depending how you tune it."


You can construct a bloom filter with an arbitrarily low error rate.


It's not technically wrong: 0 is indeed a "low probability". It's misleading, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: