My initial plan, before I started coding, was to use shingling (essentially a type of fingerprinting) to reduce the near-duplicate problem. My early test crawls were straight-up breadth-first crawls, and these also suggested that de-duplication and spam elimination were going to be significant problems. However, after I switched to a domain whitelist and intra-domain crawling my informal tests suggested that the de-duplication and spam elimination problems were significantly reduced. So I decided to leave shingling to a later iteration.
Something I didn't deal with at all was spider traps. My thinking there was that this was a relatively shallow crawl, so the dangers of getting badly stuck weren't too great. For deeper crawls this would be essential.
Something I didn't deal with at all was spider traps. My thinking there was that this was a relatively shallow crawl, so the dangers of getting badly stuck weren't too great. For deeper crawls this would be essential.