Cool idea. The readability bookmarklet (http://lab.arc90.com/experiments/readability/) does a great job at extracting the main document text without having to compare to other files, though it's currently all client-side.
I wrote a Python port which I used to create a version of the HN rss feed with the content of each article embedded into the feed itself. It became too popular and caused my server to crash.
However, Andrew Trusty put a wrapper around my code so that you could use it with any website. It's written in Python and runs on Google AppEngine. Check it out: http://andrewtrusty.appspot.com/readability/
Looks interesting! I was not able to build the jar due to the flyingsaucer dependency failing. I tried using the maven repo's [groupId/artifactId "8RC1"] in my project.clj but no dice. I've msg'd the author on GitHub. Likely a quick fix and nothing to be upset over, but just so you're all aware...
Resulted in a clean build for me. I guess you could also stick the JAR in the local maven repository too - flyingsaucer lives at https://xhtmlrenderer.dev.java.net/
Of course it wouldn't be able to filter out embedded marketing / product placement, an increasingly popular paradigm for advertisement. (But who can filter that out when it's integrated well enough?)