By the way, does anybody here know of an algorithm (and/or already implemented o...

syllogism · on Sept 17, 2014

My suggestion would be to understand the Boilerpipe algorithm, which as far as I can see is the best available solution (and much clearer than readability): http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlsc...

You can then easily adapt it for your requirements.

krapp · on Sept 15, 2014

Umm... anything that uses xpaths should work I would think.

Apologies for blowing my own horn but I've had some luck filtering HN and reddit with this project I built (I used to have an example in progress online but i've taken it down): https://github.com/kennethrapp/embedbug

akavel · on Sept 15, 2014

The point is I want some heuristic that would work "automagically" (like Readability, etc), not requiring me to invent a tailor-made xpath for each and every such website in the world.

rahimnathwani · on Sept 16, 2014

Try this:

http://fivefilters.org/content-only/

It has a default extractor, and site-specific recipes use the same format as Instapaper, so you can leverage the work Marco has done on different sites.

krapp · on Sept 15, 2014

Oh, alright.

If there is such a thing I'd be interested to learn about it myself. TBH "tailor make an xpath for every site" is the best solution i'm aware of.