By the way, does anybody here know of an algorithm (and/or already implemented open-source library/app) that copes well with auto-extracting content from forum-like websites? (i.e. phpBB, StackOverflow, HN, reddit, ...)
Umm... anything that uses xpaths should work I would think.
Apologies for blowing my own horn but I've had some luck filtering HN and reddit with this project I built (I used to have an example in progress online but i've taken it down): https://github.com/kennethrapp/embedbug
The point is I want some heuristic that would work "automagically" (like Readability, etc), not requiring me to invent a tailor-made xpath for each and every such website in the world.
It has a default extractor, and site-specific recipes use the same format as Instapaper, so you can leverage the work Marco has done on different sites.