The other day I though how cool it would be to have a Web service that could crawl your site and auto categorize all of your pages (or at least help you to do it). As ever, turns out someone is on the case ;-) Nice work! I think there's definitely a wider audience for this technology.
There are lots of uses for this, but my main advice is to not loose these three things:
1. advantage of relevancy within specific domains. The page-rank was a huge value-add to relevancy over other search. But internet wide is now too ambitious. HN is a great corpus because the content is already vetted by a community. The work of integrating other specialized communities content can give density and relevancy.
2. ease-of-use in integration. The less configuration to use this API, the better. Autotagging, done well, is very useful. I have a lot of ideas around this if you'd like to chat some time.
3. ease-of-use interface . Combining browsable, faceted search with NLP is, I think, the sweet spot between getting lots of relevant results, but allowing for discovery.
i think that'd be quite easy, being honest)
i've been working on the same concept to that app, it's hackable in 3-5 weeks :) especially having some great tools such as Weka or Mahout
it's hard to build a complete custom solution. although it's possible to allow people to use certain wrapped components, giving explanation of how to achieve best results using certain methods.
again, one person may be satisfied with a certain achieved result, but that may work not quite well for other person. just for instance, clusterization or topic extraction. you should have some knowledge about what you want to get, what are the possible outcomes of investigation, and dig into data to get what you need. generic approach will give broad resultset that may require some additional effort for real-world usage.