Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Full Text Search in the Cloud (blurpr.com)
19 points by MarcusL on June 14, 2010 | hide | past | favorite | 10 comments


Hey! I'm one of the websolr guys. Obviously, we're working on this problem, and it's a hard (but fun and rewarding) one for a number of reasons!

Search is fundamentally hard to put into the cloud, because it requires so many IO operations. In addition, delivering really high quality results requires machine learning and linguistics.

We have a few tricks up our sleeves to handle these issues, and I'm excited not only to shed the beta-ish feel, but to roll out some truly exiting features :) There's a "Review my startup" post coming pretty soon.


Hey! I'm one of the Loggly guys. Wanna grab a beer and chat about how we're doing it?


I am a co-founder in a startup (Fablo, http://fablo.pl/) that does exactly that. In order not to get wiped out by Google the day they decide to give that functionality away for free, we decided to specialize on E-commerce search, particularly in inflected languages, which is a much harder task than searching English.


I have Solr mostly working on AppEngine.

Obviously (for those of you who know the Solr codebase), there are some pretty extreme hacks to get around the lack of file system access, but nothing that couldn't be cleaned up.

I was a little surprised about the lack of interest in it when I emailed the solr-dev list.


The Compass GAE walkthrough (http://www.kimchy.org/searchable-google-appengine-with-compa...) was also able to get a rudimentary Lucene indexer up and running. However many people eventually ran into problems with App Engine's 30-second request processing limit. For your Solr instance did you utilize the Task Queue or have to do anything special to work around the 30-second limitation?


I'm not doing anything with Task Queue at the moment. I'm using the Lucene implementation from Compass (and I've used it elsewhere too), so I am familiar with it.

From memory, I think Compass had a unique problem with the 30 second limit because it would try and re-sync the non-Lucene data with the Lucene indexes (I can't remember what the trigger was for this).

I had quite a lot of issues with Compass-GAE - my impression was that it wasn't really production ready. However, I did notice that Google is using it for their ThoughSite example app, so maybe it has improved.


I feel dumb, but I don't totally understand what is being wanted. Basically a google search with "site:example.com query" but you have full control of the results?


While Google/Bing/etc are great for searching public sites (when you can tolerate the latency between when content is posted and when google indexes it), they doesn't work at all for sites which are private/intranetted/password-protected or otherwise inaccessible to web crawlers.


Makes sense, thank you.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: