I've tried using different search engines to Google numerous times, but each tim...

JohnKacz · on March 14, 2016

When I switched to DuckDuckGo last year I read an interesting comment from someone. The basic idea was that we have all become so accustomed to Google's results and the manner in which we use it (i.e. the way we define our search terms) that it is actually we who must be willing to reprogram our search practices if any competitor is to have a chance to catch up.

I'm not sure if I buy that, but I do believe if we don't commit to alternatives it will be next to impossible for alternative search engines to get as good as Google with result quality and relevance. Google simply know too much about me and has performed so many more searches for a rival to outperform them. I still use DDG's 'g!' often but I feel like I'm doing my part to help DDG get better for me and other users.

djsumdog · on March 14, 2016

Same here. I say I'd use !g on duckduckgo about 1/3 of the time. If I'm getting really frustrated on a problem and am hacking away, I sometimes just default to the !g.

Duckduckgo is now doing localized results (you can choose your region, so it's transparent, unlike Google's). The thing about Google is that, even if you're not signed in, it still tried to present you with personalized results (based on previous searches for that session, your IP, your region ... if you're searching from work; it probably factors that in as well).

When people talk about getting to the top of Google results, my response has always been, "Well you need to be more popular and relevant. Also you may be at the top..for some people, but not everyone."

sylvinus · on March 14, 2016

I definitely share your usage of "!g", which is why Common Search already supports it ;)

sylvinus · on March 14, 2016

Hi! I'm the founder of Common Search.

I don't think search result quality is on a linear scale so it's hard to define "better".

The results will definitely be less personalized, which will be a big plus for some people, and a blocker for others. There will be a few other dimensions where we can stand out, and some where we will have a hard time catching up (index size for instance).

In the end, given enough contributors, I'm pretty sure the results can get "good enough" for most people, and hopefully "better" for some ;)

libeclipse · on March 14, 2016

I think it's to do with the search engine's actual algorithm more than personalisation. Even on a completely new computer or while using tor, Google's results are pretty much always spot on.

Regardless, I'll be keeping an eye on CommonSearch.

mobiuscog · on March 14, 2016

Confirmation bias, perhaps ?

thomaskuhn · on March 14, 2016

You may want to have a look at these as they have similar goals to yours:

http://openwebindex.eu/en/

https://deusu.org

Sourcecode for deusu.org is on: https://github.com/MichaelSchoebel/DeuSu

OpenWebIndex is only in the idea stage as far as I know. Deusu has been running for over a year and already has about 1.2 billion pages in their index.

mschoebel · on March 14, 2016

Do you have a rough estimate of how many servers you will need for Elasticsearch for the 1.7bn URLs of the latest CommonCrawl?

sylvinus · on March 14, 2016

The current average size of documents in the index is 2kB, so we'll need ~3.5TB of storage. For 1 replica this could mean 5 i2.xlarge instances if we go for SSDs on AWS.

mschoebel · on March 14, 2016

Thanks for the answer.

I suggest not to use AWS if you know that you'll need a server 24/7. Old-school hosters which offer dedicated servers are much cheaper for that use-case.

There are several offers here in Europe where you can get an i7-6700, 64gb RAM and 1tb SSD for less than €60/month. AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash.

jasode · on March 14, 2016

>AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash.

Isn't there more to the analysis than just comparing cpu before we can conclude it will save a lot of money?

It looks like their servers[1] use ~150TB source data that's already hosted on AWS disks. The source x.gz archives of the Common Crawl on AWS S3 are then imported to a Elasticsearch disks that are hosted on AWS.

To pull ~150TB of data using network speeds of 30 megabytes/sec[2] would take 60 days to transfer from AWS to another USA datacenter like Rackspace.

(Copying data from AWS to AWS isn't instantaneous either but it won't take ~60 days. At 60 days, the next crawl archive would have been released before you finished importing the previous one!)

Questions would be:

1) What are current 2016 network speeds between cloud providers?

2) What's the cost of ~150TB of network bandwidth?

3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?

[1]https://about.commonsearch.org/developer/operations

[2]http://www.networkworld.com/article/2187021/cloud-computing/...

fweespee_ch · on March 14, 2016

> 1) What are current 2016 network speeds between cloud providers?

I'm pretty sure if you need to ingest ~150 TB you can pull it from AWS/S3 much faster than you think. To absorb ~150TB you'd need ~75 nodes. Given you can download partials of Common Crawl, you can break it up to 75 nodes downloading in parallel with 1gbit/s ports you should be able to pull it down relatively quickly compared to your estimation.

I'd bet you could pull ~150 megabytes/s [16 mbit/s per node].

http://commoncrawl.org/the-data/get-started/

> The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3.

https://www.hetzner.de/en/hosting/produkte_rootserver/ex41s

> 2) What's the cost of ~150TB of network bandwidth?

Free.

> There are no charges for overage. We will permanently restrict the connection speed if more than 30 TB/month are used (the basis for calculation is for outgoing traffic only. Incoming and internal traffic is not calculated). Optionally, the limit can be permanently cancelled by committing to pay € 1.39 per additional TB used. Please see here for information on how to proceed.

> 3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?

I suspect you are greatly overestimating the difficulties since most DCs basically let you ingest/download for free because of the asymmetry on their networks.

sylvinus · on March 14, 2016

One thing to consider is that we can build the index on AWS and then only do replication with other datacenters at the Elasticsearch level, which is ~50x smaller than the raw data.

dhimes · on March 14, 2016

You might be able to get around the personalization issue by allowing users to opt-in to certain tags for particular searches: developer, sports, music, etc.

djsumdog · on March 14, 2016

Duckduckgo added localization, but it's transparent and configurable.

zookatron · on March 14, 2016

I think what we really need is basically exactly what Google has in terms of search technology, except that it needs to be open and explicit, instead of it being Google's secret proprietary data on me that I can never access and that they get to sell to advertisers to my detriment.

I (like many other people on HN) use Google constantly when programming, and it's impossible to overstate the convenience and power of Google's almost creepy ability to guess exactly what the language and context of my search query is. It cuts precious seconds off of each query (when I am routinely make hundreds of queries per day), and more importantly cuts out the interruption of mental flow as you try to re-word your query into a format that the search engine will understand. This can potentially add up to hours of saved time per day, depending on how you calculate the impact of these features.

In order to duplicate this I don't think we can get around the need for "search profiles", which takes into account your location, interests, past searches, personal connections, etc, but it needs to be explicit and it needs to be my data. If I want to delete it or sell it, it really needs to be up to me. If we could figure out a secure way to do this, then we would have the framework necessary to compete with Google with an open platform. Until then, it's just not going to be worth it to switch for the vast majority of people.

erik14th · on March 14, 2016

I believe trying to create a general search that's better than google is pretty much a lost battle.

I think the next big thing in search will start as a niche thing. If you reduce your user domain you can have a better shot at producing better results even without tracking.

For example, you could develop a search engine for developers/IT people, make it's results better than google's and then expand to other domains.

nostalgiac · on March 14, 2016

I agree. The times I search on computers in the address bar and then wonder why the results are bafflingly awful, I realise the search is set to Bing or any other Search Engine. Re-typing the exact same phrase into Google and suddenly the top results are exactly what I wanted in the first place.