What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index?
Raw-data may be better for batch-processing or running multiple queries at the same time.
My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months.
The reverse-word-index would be about 18gb per day for the same number of pages.
Reverse-word-index is already compressed, raw-data isn't.
There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that.
BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled.
What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index?
Raw-data may be better for batch-processing or running multiple queries at the same time.
My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months.
The reverse-word-index would be about 18gb per day for the same number of pages.
Reverse-word-index is already compressed, raw-data isn't.
There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that.
BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled.