[lightly modified version of a comment I put on the article as I love HN for discussion!]
Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.
The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don't even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters.
Also, we're performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we've actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.)
Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they've done, and in our case the numbers were for a single month's crawl.
We're excited to see Martin use one of our crawl archives in his work -- seeing these experiments come to life the best part of working at Common Crawl! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about $30. Big data on a small budget is a top priority for us!
Before Google's page rank algorithm there was a lot of research into document search. A favorite of mine was 'scatter gather'[1].
I often wanted ability to filter and group when staring a pages of results from redhat jboss support forums when trying to fix a dead jboss cluster. But I quit that job so haven't had the need recently.
Edit; point being it would be nice if someone came up with a service that implemented the ideas in this paper :)
Is there any chance we might see an updated index for the Common Crawl? I've tried using the dataset before, but I found it difficult given that you have to process the entire thing in order to find the particular pages you are looking for.
As an example, I was trying a project to look at the top news sites, like BBC, CNN, Al Jazeera, etc. Then matching articles about the same news topic on each site, before finally fact checking for differences between the stories (ie. 20,000 homes were without power vs. 50,000)
That kind of project requires a load of crawling, but I can't look for specific pages without processing the entire CC set first.
I love the project though, so thank you for doing it!
So awesome to see that folks are discovering Common Crawl, it's a really great project and the amount of data they've crawled has been tremendous. If you're looking to get into machine learning, it's perhaps the most comprehensive dataset you can get your hands on.
There's are two levels of optimizations that come into play: AWS setup and the choice of primary language. I'm always happy to speak about both as we love seeing experiments run over the data!
AWS optimizations: For that level of cost efficiency, you really need to use spot instances. A cluster of 100 m1.xlarge machines (1.5TB of RAM, 400 cores, and 168TB of magnetic disk storage) will only cost you $3 per hour using spot instances, rather than $30 on-demand. You should pay on-demand prices for the Hadoop master however.
You'll also want to roll your own Hadoop cluster as opposed to using Elastic MapReduce (EMR). EMR is amazing but the cost overhead when using it on spot instances is ~100%.
For the code itself, this is a situation where you'll want to stick with the programming language that has both the best performance and best ecosystem. I'm personally not a big fan of Java, but it really does win out here -- it's close to C or C++ for performance and has the advantage of the Hadoop ecosystem behind it. Other languages are certainly usable, but even if LanguageX only ran 4x slower than Java, the resulting job would be 4x more expensive due to paying by the hour.
Other than that, it's really just a standard MapReduce job using Hadoop. You can see an three examples for the three different data formats we use at:
https://github.com/commoncrawl/cc-warc-examples/
Thanks for your comments. I see you get 400 cores for 3USD/h . Where you optimizing for processing power per dollar? Do you know if the m1.xlarge instances are the best for this use case?
It really depends on the use case. I mentioned m1.xlarge as they provide a good general cluster setup. The mix of CPU, RAM, and disk space should work well for most experiments one might want to perform. m1.xlarge instances also have 1Gbps network interfaces when others in the same price range have 500Mbps -- a vestige of the older generation of machines. Finally, they excel at disk space. If you're utilizing HDFS heavily, newer instances are usually SSD (good) but have 10 to 20 times less disk storage (bad).
For the same dollar amount, you can trade for other specs though.
If you're more interested in CPU / RAM / SSD for example, paying $3USD/h for r3.xlarge gets you 3TB of RAM, about 1.5 times more computing power (same number of cores but more compute units), but far less disk space -- 8TB of SSD.
In the end, it really depends on the task at hand, but your dollar does go quite far regardless!
what would be the benefit of having the web as a dataset when it is rife with copyright laws (see craigslist), monopoly businesses who viciously protect their human uploaded content and profiles, and majority of the data from the web being useless without a context and purpose of the searcher?
I'm just curious as to how commoncrawl compares with kimonolabs and import.io as they seem to have the same goal of creating an internet as a dataset, or an API. I can't help but feel like it's just solving another 'semantic web' problem that nobody asked for.
It is funny that the most demanding customers of semantic web are also the ones who are willing to pay the least amount of time and money.
Regarding copyright issues, I believe you can still use copyrighted data as long as it's transformed. E.g., building language models, or doing a search engine like Google. In fact, I can think of more computational uses for copyrighted data, while on the "banned" side, I can only think of... SEO.
Regarding point two: "monopoly businesses who viciously protect their human uploaded content". I spend a lot of time scraping these monopoly businesses, and it seems to me they do a decent job of letting their users decide what data is exposed. Facebook, linkedin, and google all are decent about letting me scrape their public info. That's all I have a right to -- private info should stay private, at the behest of the owner (the User in UCG).
You are correct regarding the third point, but I don't see that as a problem. This isn't a solution in search of a problem -- it's a problem without a solution at the moment.
Here's a toy example of something I'd like to do: calculate the positive / negative sentiment of commenters at particular baseball fan sites, so I can hide the content I don't like, and show that which I do. Having a common crawl of the site would be immensely useful (and is indeed a prereq) for this. I wouldn't need to republish it, just compute on it.
At first Google was a search algorithm, but at some point they decided to have humans review and rank the important queries. Important as in query volume.
Why use humans? People can decide if your navigation is intuitive. They can decide if your page looks like crap. If 230,000 people are searching for "coconut oil" per month (actual numbers) then it's worth having an intern spend 15 minutes to make sure page 1 of "coconut oil" looks right.
Google can afford that. They need a human to decide if the "user experience" is actually good vs. disallowing the back button and forcing the browser to crash, which is how I suppose you could fake a "time on site" metric if this was just an algorithmic problem.
Google is now more like playing Zork. You type "Go North" like 10 million other people before you typed "Go North" and Google has already crafted that experience you'll find in next room. (Which makes me wonder, do they score how boring you are based on predictability?) This is becoming more and more obvious over time as a search for "calculator" shows you an actual calculator that a human at Google created. That's not an algorithmic response.
Similarly, I see that human touch coming more into play with voice recognition, Google Glass, Siri, etc. Call that "AI" or whatever. You ask Google a question and Google has already sculpted a slick answer based on tons of testing. That's how I see Google as a search engine now. Part of the crawling is interesting (recognizing objects in photos?) but I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.
I think human reviews of all the important websites and SERPs, that's harder for a competitor to reproduce.
Google was forced into that by improved "search engine optimization". SEO used to be about things like keyword stuffing, but as Google made their search engine smarter, SEO companies made their search spamming smarter. There are now SEO operations using machine learning to reverse engineer Google's algorithms and then automatically spam just enough to stay under the threshold.
In 2010, Google tried using "local" data to improve search. That turned out to be extremely easy to spam. A classic example of this can be found by searching for "laptop repair bradford pa". This brings up "Illusory Laptop Repair", located in the middle of a railroad crossing. A SEO expert created that phony business listing to demonstrate how bad Google was at detecting such spam. Google still thinks it's real.
In 2012, Google tried using "social" data to improve search. That worked even worse. Fake Google accounts created to create fake "+1"s may have exceeded the number of real ones. Google "+1" are still for sale; the going rate is about $0.10 each.
Meanwhile, links aren't as useful as they used to be. Who creates a link to a retail outlet other than on social media any more? Google is trying all sorts of "signals", but in heavily spammed areas, they're not doing all that well.
Yandex has been trying search that doesn't weight links at all for some heavily spammed categories in the Moscow area. It seems to be working for fake real estate ads.
(We have a partial solution - find the real-world business behind the web site and check it out in hard data sources, such as Dun and Bradstreet or Experian, which have business credit data. See "sitetruth.com/doc".)
I agree. And what are the implications? Wasn't Google sued by people unhappy with their rankings (Howard Stern?) and wasn't Google's defense that their algorithm was unbiased? (Not a rhetorical question, I never followed how that played out.) Once you introduce human reviewers, you're going to have more unhappy businesses. I'd rather a SEO spammer push me down (nothing personal) than know an actual Googler secretly decided my website wasn't good enough--a personal bias against me in particular. Who are the reviewers, what do we know about them, what are they looking for, not looking for, have they reviewed you personally, etc. I think Google may call it a "manual action" in Webmaster Tools, an ambiguous way of saying somebody at Google manipulated the algorithm against you. Do they have different levels of manual actions? Are manual actions always disclosed?
>A classic example of this can be found by searching for "laptop repair bradford pa". This brings up "Illusory Laptop Repair", located in the middle of a railroad crossing. A SEO expert created that phony business listing to demonstrate how bad Google was at detecting such spam. Google still thinks it's real.
I volunteered a bit early this year for Common Crawl (not much, just some Java and Clojure examples for fetching and using the new archive format).
Common Crawl already has many volunteers (and a professional management and technical staff) so it would seem like a good idea to merge some of the author's goals with the existing Common Crawl organization. Perhaps more frequent Common Crawl web fetches and also making the data available on Azure, Google Cloud, etc. would satisfy the author's desire to have more immediacy and have the data available from multiple sources.
I've always wanted to experiment with my own search algorithm. Unfortunately, I think this is still out of the budget of average programmers. Just the hard drives to download 1.3 petabytes would cost six-figures.[1][2]
1) I like the idea of human curation, but in combination with some sort of automated crawler (or other tool) that helps in the browser.
2) Why can't we also distribute the act of crawling, the maintenance of the index and the map-reduce (or other algorithm) that produces the data.
I've been thinking about architectures that would allow (in essence) a P2P search system. Would anyone be interested in talking about architectures to make this work? There are millions of computers on the web at any given time ... if it's built into the browser (or plugs in), you could have human input at the same time.
Yeah, this sounds all well and good in theory, but after visiting thousands of sites over the years, it might be a better idea to help engineers build a search engine for their own site/data first. I can't recall many websites that have amazing search. It's a problem when I have to use google to find what I want on xyz.com because if I go search for what I am looking for on xyz.com I cannot find it even if I know its on that site.
It would be so nice to go to xyz.com and actually find what I am looking for in under 1 second.
This is customized for the site to 'power their own search'. The query sent to Google is based on the keyword but also on the article type (in this example) and then the xml that is returned is post-processed to have a custom render per article type.
Since we had all the data the results from Google were compared to a local search I setup in Sphinx (http://sphinxsearch.com/) and the Google results were more relevant and it was a lot cheaper to deploy.
I think "full-text indexing" is a very different problem than "search."
Full-text indexing (what ES provides) has been around for almost forever, ES just does a way better job of productizing/delivering it.
However, Google is far more than a text index. Ranking, currently is still very difficult and requires messing around with facet and weighting parameters.
I don't think so. I just went to their site and the first case study i opened was from theguardian.com. I went to theguardian.com and went to do a search. Guess who they are using to power their search function? Google.
In my opinion which means nothing, sites need to figure out how to power their own search. Using a third party isn't going to work for most. Maybe people need to focus on building custom architecture that indexes the data in a more structured way rather than cobbling systems together that ultimately hinder search efforts when its time to get the user what they want. I don't know the answer but somebody eventually will. Maybe wordpress will create a powerful search for all those wordpress sites.
Maybe I'm beating a dead horse but I feel like you could create a pretty compelling search engine using ES. I used it this summer at Goldman and it seems like it will really change the landscape... It's just so fast. I mean full-text-indexing seems to be a pretty integral part of search. Maybe it's just the first step, with the second step being a well written ranking function, but that's just my thought.
Article is clearly from an earlier era, but it's really cool to see how far we've come and how much more computing power we have available now. There are entire categories of problems that simply don't exist anymore.
She later designed the search engine of Cuil. While Cuil failed, it only cost them about $30 million to do most of what Google does.
It's surprising to me that there aren't search engines from Comcast, AT&T, and Apple. If you have customers, why give up all that ad revenue to Google? Google is paying some big players a lot of money not to do that. They were paying Apple $1 billion a year to be the default on Apple products. Apple switched from Google to Bing anyway.
While Cuil failed, it only cost them about $30 million to do most of what Google does.
They raised ~$30 million in two rounds, but their valuation was at $200 million by round two. I agree with your point though; the cost to develop a good search engine is dirt cheap compared to the value it brings.
"Valuation" by whom? They had no revenue, no revenue model, no VC would give them additional funding, and Google didn't buy them out. On September 17, 2010 at 1 PM, all the employees were told the company was shutting down.
Google did hire the CEO and Anna Patterson to keep them from doing another search engine.
I really enjoyed that article. I read it over a year ago while I was doing Udacity’s [Intro to Computer Science](https://www.udacity.com/course/cs101) course where you learn to build a web crawler and implement a basic page ranking algorithm.
This would be really cool to participate in, especially if it could be packaged à la Folding@Home/SETI@Home and widely distributed. I wonder if there's some clever method using crypto that can provably discourage bad actors if the network has certain properties (e.g. Bitcoin is nearly impossible to cheat unless one group owns >50% of the network).
Google's power comes not from the crawling, but from the retrieval and ranking. They use many more signals than the hyperlinks and anchor text (which is all you'd have if you crawled yourself). Indexing crawled content would have been OK in the year 2000; but today, the users demand more. Relevance is the top priority, and no one does it better than El Goog.
Sorry but Google's ranking algorithm for me is far from brilliant.
To give you an example, search for "webhcat primary key" (without quotes) and note how the top three search results do not actually contain the term webhcat. Google constantly does this. It randomly ignores search terms unless you explicitly quote them.
I believe that there is still a market for a technical/advanced search engine.
Isn't google doing that because it detected the semantic information was on the page, even if the exact term wasn't? Is your issue with the fact that they're doing more than just a keyword retrieval, or is your issue with the fact that they're doing it poorly?
Isn't my issue obvious ? I wanted search results that contained the search terms. Otherwise I wouldn't have entered them in the first place.
I understand Google is trying to be clever here and appealing to novices who don't really understand what they want.
But my point is that for those of us that do it is an incredibly annoying "feature". Feature is in quotes because in the specific case above they didn't find semantic equivalents. They just dropped the "webhcat" term entirely.
Not touching that with a 10 foot pole... (just teasing, just teasing...)
But seriously, it sounds like what you want is a keyword matching engine. Google, for better or worse, has decided they know enough about their user's searches that they don't mind modifying the query parameter in an attempt to retrieve what people want rather than what they literally say they want.
I understand that you don't feel they're serving your needs any longer, and that can be frustrating. I think, however, that you can preceded mandatory terms with a + sign to require it to be present on page.
After Google Plus was released, Google changed its search syntax slightly so that you now need to "quote" mandatory terms instead of prefixing them with a + symbol.
Maybe more people should start crawling and seeing what they can extract? I remember seeing DuckDuckGo Instant Answers and thinking what a valuable resource that would be (having a database like DDG must have, I mean).
Then one would be able to do some "stuff Google can do" - say, analysing trends - albeit with worse sampling, and not depend that much on them.
The problem with algorithmic/scraper search methods, is that they only work with existing data. For example, most Google searches gives a list of websites on one side, and some data scraped from Wikipedia on the other. There is not much meaning there. That's because Google's algorithm cannot combine the results into something original, because that would require human creativity. As such, I see the rise of different kinds of search based on what humans create, rather than what computers can scrape. I wrote a (longish) blog post on this problem: http://newslines.org/blog/googles-black-hole/
suprised not to see a mention of a bloom filter in url dedupe. Another tough problem now is the portion of the web in walled gardens or that is expensive to crawl (needs a js context).
If my memory serves me correctly so is it only the client part of grub that is open source. Without the server part on cannot use it to setup one’s own crawl.
Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.
The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don't even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters.
Also, we're performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we've actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.)
Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they've done, and in our case the numbers were for a single month's crawl.
We're excited to see Martin use one of our crawl archives in his work -- seeing these experiments come to life the best part of working at Common Crawl! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about $30. Big data on a small budget is a top priority for us!
[1]: http://blog.commoncrawl.org/2014/11/october-2014-crawl-archi...