One thing I hope this project does that Google fails to do is give developers a good API to access search. Google closed down their first web search API and now only give developers access to a limited Custom Search API that's rate limited to 100 queries a day for free with a hard limit of 10k searches - that makes it either very hard to develop anything against or relatively expensive. There are other options (Bing, Faroo, raw access to CommonCrawl) but they're either low quality or hard to work with. A good quality, straightforward, open web search API would be awesome.
I would pay for an API, that gives me access to partial quality crawl content.
It would be even better, if the web would be treated, at least for parts, as a digital library and non-profit organizations would recognize the value of access to such a resource and provide it (just as maintenance for roads or public schools).
I would love to have a list of URL of all .edu domains, which contain the word "publications". It's a bit silly, but I'd love to build a open version something like Google Scholar.
I can think of many other use cases as well, where a product needs to built from larger, but very selected set of raw input or pages.
BTW: Thanks for Hackday Paris 2011! Loved that event and venue :)
Hey, you're welcome! Hackday Paris 2011 brings back some nice memories ;)
One problem with what you'd like to do is the pagination. Because we have to send queries to all the shards and then re-rank them, it becomes increasingly hard (and useless for most users) to build pages with p > ~20 (which is why all search engines heavily limit their pagination). So our main infrastructure definitely won't be optimized for that :/
However sending the top ~500 pages with a keyword + a domain filter should be doable pretty easily!
This is basically my use-case too, in that it's more important to get access to _every_ document which contains that keyword and less important to rank them in a search-engine order. I think that it may be possible to do fairly inexpensively[1] but I'm still benchmarking to pick the right mix of technologies and data structures.
(I'm not with CommonSearch. I have my own project that crawls extensively though.)
You do realize that you are talking about potentially a LOT of data?
To give you an example: The word "work" occurs on about 4% of all web-pages. So even if there were only about 2bn pages in an index, that would mean 80 million matching pages. Even if you only need their URLs that would be about 2.4gb of data assuming an average URL length of 30 bytes. Ok, compression can make that smaller, but still...
It would also mean that the server would need to make 80 million random reads to get the URLs. Even with SSDs that would take some time. Hmm, actually in this case it may be faster to just read all URL-data sequentially, than doing random reads. But in both cases we would be talking about minutes needed to get all that data from disk.
I currently have a search-index with about 1.2bn pages - I expect to reach 2bn pages by mid-May - that could be used to get the kind of data you need. But not in a realtime API. Not that amount of result-data.
1) I'd be very interested in such a service.
2) Yep, it's a lot, but that query is quite a lot bigger than most. Assuming some constraints on the layout of the index, I estimate you'd spend roughly $70 plus taxes and compute time retrieving the indexed documents from S3 for that query. You'd always be able to reduce or expand the keywords to and only retrieve as much as you could afford. I think there's value in both allowing people to tackle querying the index by themselves and providing a paid-for managed service that automates much of that.
What would be more useful to you, the raw data - meaning for each page a list of the keywords on it - or the reverse-word-index?
Raw-data may be better for batch-processing or running multiple queries at the same time.
My crawler currently outputs about 40-45gb of raw-data per day (about 30 million pages). Full crawl will be 2bn pages, updated every 2-3 months.
The reverse-word-index would be about 18gb per day for the same number of pages.
Reverse-word-index is already compressed, raw-data isn't.
There is a small problem with the crawl though, as it does not always handle non-ascii characters on pages correctly. I'm working on that.
BTW: I also currently have a list of about 8.5bn URLs from the crawl. About 600gb uncompressed. These are the links on the crawled pages. Obviously not all of those will end up being crawled.
I'm not the original commenter but there could be a big huge case in research. Lots of researchers work on UIs for search, interactive search systems, and query refinement algorithms that are really just abstract layers over an existing search engine. It used to be that we could just overlay stuff over Google, but most search engines nowadays are a pain to work with.
I see scientific agents on the horizon and with them a new wave of research. But access to raw data, e.g. publications, is very limited and restricted. It's a pity.
I'd be especially interested in a crawl results API if it could distinguish "news" sites from other content. Some of our work involves analyzing content, extracting keywords and then looking for relevant news (and other context) around those keywords. Of course the crawl would need to be relatively fresh to be useful for that.
Worst case we may build our own specialized crawler just for this purpose, but it would be nice if there was a useful search engine API we could leverage. And, of course, we'd be happy to pay for access to such an API.
Some of my primary interests are building websites that can analyse other websites and for that, I need a keyword index and access to the underlying crawl data (as in, a response that can point me to the exact file offsets that contain the pages). Think [1] but with keywords instead of URLs.
From their FAQ: "We will particularly stand out with features that are not in the best interest of commercial search engines: showing less (if no) ads, having an open API, and generally not trying to maximize the amount of time users spend on our service."
Seems like right now they are focusing on getting contributors though.
Can I ask what's wrong with Bing's API (outside of it being Bing)? 5000 monthly transactions before you begin paying, and it doesn't appear to be hard to work with?
I've tried using different search engines to Google numerous times, but each time I've returned to Google simply because the searches are better. They're more accurate, more relevant, and I very rarely find myself searching more than once to find something.
If commonsearch can beat Google in that regard, then count me in. But I doubt it will.
When I switched to DuckDuckGo last year I read an interesting comment from someone. The basic idea was that we have all become so accustomed to Google's results and the manner in which we use it (i.e. the way we define our search terms) that it is actually we who must be willing to reprogram our search practices if any competitor is to have a chance to catch up.
I'm not sure if I buy that, but I do believe if we don't commit to alternatives it will be next to impossible for alternative search engines to get as good as Google with result quality and relevance. Google simply know too much about me and has performed so many more searches for a rival to outperform them. I still use DDG's 'g!' often but I feel like I'm doing my part to help DDG get better for me and other users.
Same here. I say I'd use !g on duckduckgo about 1/3 of the time. If I'm getting really frustrated on a problem and am hacking away, I sometimes just default to the !g.
Duckduckgo is now doing localized results (you can choose your region, so it's transparent, unlike Google's). The thing about Google is that, even if you're not signed in, it still tried to present you with personalized results (based on previous searches for that session, your IP, your region ... if you're searching from work; it probably factors that in as well).
When people talk about getting to the top of Google results, my response has always been, "Well you need to be more popular and relevant. Also you may be at the top..for some people, but not everyone."
I don't think search result quality is on a linear scale so it's hard to define "better".
The results will definitely be less personalized, which will be a big plus for some people, and a blocker for others. There will be a few other dimensions where we can stand out, and some where we will have a hard time catching up (index size for instance).
In the end, given enough contributors, I'm pretty sure the results can get "good enough" for most people, and hopefully "better" for some ;)
I think it's to do with the search engine's actual algorithm more than personalisation. Even on a completely new computer or while using tor, Google's results are pretty much always spot on.
Regardless, I'll be keeping an eye on CommonSearch.
OpenWebIndex is only in the idea stage as far as I know. Deusu has been running for over a year and already has about 1.2 billion pages in their index.
The current average size of documents in the index is 2kB, so we'll need ~3.5TB of storage. For 1 replica this could mean 5 i2.xlarge instances if we go for SSDs on AWS.
I suggest not to use AWS if you know that you'll need a server 24/7. Old-school hosters which offer dedicated servers are much cheaper for that use-case.
There are several offers here in Europe where you can get an i7-6700, 64gb RAM and 1tb SSD for less than €60/month. AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash.
>AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash.
Isn't there more to the analysis than just comparing cpu before we can conclude it will save a lot of money?
It looks like their servers[1] use ~150TB source data that's already hosted on AWS disks. The source x.gz archives of the Common Crawl on AWS S3 are then imported to a Elasticsearch disks that are hosted on AWS.
To pull ~150TB of data using network speeds of 30 megabytes/sec[2] would take 60 days to transfer from AWS to another USA datacenter like Rackspace.
(Copying data from AWS to AWS isn't instantaneous either but it won't take ~60 days. At 60 days, the next crawl archive would have been released before you finished importing the previous one!)
Questions would be:
1) What are current 2016 network speeds between cloud providers?
2) What's the cost of ~150TB of network bandwidth?
3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?
> 1) What are current 2016 network speeds between cloud providers?
I'm pretty sure if you need to ingest ~150 TB you can pull it from AWS/S3 much faster than you think. To absorb ~150TB you'd need ~75 nodes. Given you can download partials of Common Crawl, you can break it up to 75 nodes downloading in parallel with 1gbit/s ports you should be able to pull it down relatively quickly compared to your estimation.
I'd bet you could pull ~150 megabytes/s [16 mbit/s per node].
> The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program.
From Public Data Sets, you can download the files entirely free using HTTP or S3.
> 2) What's the cost of ~150TB of network bandwidth?
Free.
> There are no charges for overage. We will permanently restrict the connection speed if more than 30 TB/month are used (the basis for calculation is for outgoing traffic only. Incoming and internal traffic is not calculated). Optionally, the limit can be permanently cancelled by committing to pay € 1.39 per additional TB used. Please see here for information on how to proceed.
> 3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?
I suspect you are greatly overestimating the difficulties since most DCs basically let you ingest/download for free because of the asymmetry on their networks.
One thing to consider is that we can build the index on AWS and then only do replication with other datacenters at the Elasticsearch level, which is ~50x smaller than the raw data.
You might be able to get around the personalization issue by allowing users to opt-in to certain tags for particular searches: developer, sports, music, etc.
I think what we really need is basically exactly what Google has in terms of search technology, except that it needs to be open and explicit, instead of it being Google's secret proprietary data on me that I can never access and that they get to sell to advertisers to my detriment.
I (like many other people on HN) use Google constantly when programming, and it's impossible to overstate the convenience and power of Google's almost creepy ability to guess exactly what the language and context of my search query is. It cuts precious seconds off of each query (when I am routinely make hundreds of queries per day), and more importantly cuts out the interruption of mental flow as you try to re-word your query into a format that the search engine will understand. This can potentially add up to hours of saved time per day, depending on how you calculate the impact of these features.
In order to duplicate this I don't think we can get around the need for "search profiles", which takes into account your location, interests, past searches, personal connections, etc, but it needs to be explicit and it needs to be my data. If I want to delete it or sell it, it really needs to be up to me. If we could figure out a secure way to do this, then we would have the framework necessary to compete with Google with an open platform. Until then, it's just not going to be worth it to switch for the vast majority of people.
I believe trying to create a general search that's better than google is pretty much a lost battle.
I think the next big thing in search will start as a niche thing. If you reduce your user domain you can have a better shot at producing better results even without tracking.
For example, you could develop a search engine for developers/IT people, make it's results better than google's and then expand to other domains.
I agree. The times I search on computers in the address bar and then wonder why the results are bafflingly awful, I realise the search is set to Bing or any other Search Engine. Re-typing the exact same phrase into Google and suddenly the top results are exactly what I wanted in the first place.
I think people might underestimate the power of an open source search engine. In my eyes it is like wikipedia versus the old paper encyclopedia books. Improvements to search results in Google are done by a relatively small amount of people from Google. Google decides where you buy, what you think and how you live. Behind their algorithms they probably have made dozens of subjective choices. Public debate, more attention to details, and open politics are as I see it, great tools to improve search engine quality.
And even if organizations smaller and less sophisticated than Google can't match the kind of search quality that Google has achieved, the autonomy and transparency issues are still important.
An interesting metaphor for search engines' power is in
>Improvements to search results in Google are done by a relatively small amount of people from Google
How many open source projects log more engineering hours than Google's search team? It's the flagship product of one of the largest corporations in the world.
Wikipedia has 133,621 active registered users, and 27,755,916 users. Furthermore, Wikipedia has 819,043,068 page edits. While Google will probably have better and more engineers, but they rely on usage statistics, and not experts in the specific search domains.
I like the project's goal but as techies, we inevitably want to understand the technical details and how it helps (or handicaps) the search results in comparison with Google.
For example, the project's data sources[1] says that the bulk of data comes from The Common Crawl. It looks like the CC is ~150 TB of data[2]. I'm not familiar with google.com internals but various sources estimate that their proprietary crawl dataset is more than a petabyte. (A googler could chime in here with more accurate data.)
So it's not as simple as the algorithm for Common Search being "more fair" than the algorithm for Google Inc. The underlying dataset in terms of quantity, recency, rules for the robot, etc all affect the algorithm.
This is not a criticism of the project. It is my attempt to understand what is not obvious on the surface level.
Hi! Data is indeed as important as the algorithm. Common Crawl is a very good bootstrap but we will certainly need to go beyond once it proves to be the limiting factor. We also hope we can help them improve their data set in the short term by giving them a larger URL seed list.
Seeing how the founder is the same who founded Jamendo which later was turned into a sad, user-unfriendly attempt to make money with freely licenses music (destroying its community in the process), how can I trust commonsearch not to be a waste of time and attention?
Well I actually share some of that anger so I'm sorry if I read too much into "destroy" and "sad". Common Search is forkable by design so it should hopefully stay on course one way or the other!
I'm trying to find out from their website, but it's unclear. Are the servers hosted in the USA? And will the organisation be incorporated in the USA?
If you're talking about privacy and transparency, it's better to operate in a place bound the European Charter of Fundamental Rights, rather than the US Constitution, because the former gives people much more rights with their data, how it's used, etc.
Thanks for your concern. However I was under the impression that Wikimedia was a US organisation, with some local chapters. Perhaps incorporate in a EU country from the start?
Totally on point, if privacy is a concern the company should not be inc. on American soil. I'm also having a hard time trying to find the roadmap for infrastructure, how many machines are running or will be expected to run. Scaling is not trivial and I don't see how that would be addressed.
Right, there is no specific roadmap for infrastructure yet or view of what servers are currently running. I will add those things on our Operations page, thanks! https://about.commonsearch.org/developer/operations
Neat, I was working on a project to give a full programmatic keyword index to the contents of the common crawl, but I guess there's no need! It's very exciting to consider what kind of applications you can build with this.
I think Google once mentioned that each day a surprising amount of searches are unique. They were never done before. If memory serves me correct that number was 30-40%.
I'd like to see such search as well. But I think the long tail queries results may not be so bad compared to the current ones, because, especially for obscure queries that end up being searched only a very small number of times per week, the risk of malicious manipulation is (I think) lower and the documents people end up their search with is a good indicator of relevant result.
This sounds awesome. Speaking of building AIs/bots and such in your FAQ, the lack of a good open API for search is probably what gates that market to Google and Microsoft and such... That nobody else can just tap a search engine. I'd love to be able to connect to this for queries at some point.
"nonprofit" for me is a bad smell. I.e. the problem of sustainability, which for nonprofits is all about the money and not about carbon or solar energy, rainbows, plutonium or any of that.