> I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added.
other way round? bloom filter provides a low probability of erroneously believing a URL had already been added when it had not, zero probability of believing a URL had not already been added to the filter when in fact it had.
using a bloom filter in this way guarantees you won't ever hit a page twice, but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.
> but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.
My (not-very fresh) memory of what bloom filter actually is tells me that this "non-zero rate" you're talking about must be HUUUUGE. In order of millions of pages. Am I right?
other way round? bloom filter provides a low probability of erroneously believing a URL had already been added when it had not, zero probability of believing a URL had not already been added to the filter when in fact it had.
using a bloom filter in this way guarantees you won't ever hit a page twice, but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.