Sunflower: (clojure) extract story text from HTML markup tree

zacharypinter · on April 18, 2010

Cool idea. The readability bookmarklet (http://lab.arc90.com/experiments/readability/) does a great job at extracting the main document text without having to compare to other files, though it's currently all client-side.

nirmal · on April 18, 2010

I wrote a Python port which I used to create a version of the HN rss feed with the content of each article embedded into the feed itself. It became too popular and caused my server to crash.

However, Andrew Trusty put a wrapper around my code so that you could use it with any website. It's written in Python and runs on Google AppEngine. Check it out: http://andrewtrusty.appspot.com/readability/

If my host was again acting up, I'd link my port.

ropiku · on April 18, 2010

I also found a ruby port: http://github.com/iterationlabs/ruby-readability

devin · on April 18, 2010

Looks interesting! I was not able to build the jar due to the flyingsaucer dependency failing. I tried using the maven repo's [groupId/artifactId "8RC1"] in my project.clj but no dice. I've msg'd the author on GitHub. Likely a quick fix and nothing to be upset over, but just so you're all aware...

jefffoster · on April 18, 2010

I found that running

  lein deps
  lein uberjar

Resulted in a clean build for me. I guess you could also stick the JAR in the local maven repository too - flyingsaucer lives at https://xhtmlrenderer.dev.java.net/

nathell · on April 18, 2010

I've uploaded the jar to the Downloads section and also added a note about Flying Saucer to the readme. Thanks for checking it out!

evanrmurphy · on April 18, 2010

Of course it wouldn't be able to filter out embedded marketing / product placement, an increasingly popular paradigm for advertisement. (But who can filter that out when it's integrated well enough?)

Neat idea.

regularfry · on April 18, 2010

Sounds related to ariel: http://ariel.rubyforge.org/