Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

By the way, does anybody here know of an algorithm (and/or already implemented open-source library/app) that copes well with auto-extracting content from forum-like websites? (i.e. phpBB, StackOverflow, HN, reddit, ...)


My suggestion would be to understand the Boilerpipe algorithm, which as far as I can see is the best available solution (and much clearer than readability): http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlsc...

You can then easily adapt it for your requirements.


Umm... anything that uses xpaths should work I would think.

Apologies for blowing my own horn but I've had some luck filtering HN and reddit with this project I built (I used to have an example in progress online but i've taken it down): https://github.com/kennethrapp/embedbug


The point is I want some heuristic that would work "automagically" (like Readability, etc), not requiring me to invent a tailor-made xpath for each and every such website in the world.


Try this:

http://fivefilters.org/content-only/

It has a default extractor, and site-specific recipes use the same format as Instapaper, so you can leverage the work Marco has done on different sites.


Oh, alright.

If there is such a thing I'd be interested to learn about it myself. TBH "tailor make an xpath for every site" is the best solution i'm aware of.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: