Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sunflower: (clojure) extract story text from HTML markup tree (blog.danieljanus.pl)
24 points by gtani on April 18, 2010 | hide | past | favorite | 8 comments


Cool idea. The readability bookmarklet (http://lab.arc90.com/experiments/readability/) does a great job at extracting the main document text without having to compare to other files, though it's currently all client-side.


I wrote a Python port which I used to create a version of the HN rss feed with the content of each article embedded into the feed itself. It became too popular and caused my server to crash.

However, Andrew Trusty put a wrapper around my code so that you could use it with any website. It's written in Python and runs on Google AppEngine. Check it out: http://andrewtrusty.appspot.com/readability/

If my host was again acting up, I'd link my port.



Looks interesting! I was not able to build the jar due to the flyingsaucer dependency failing. I tried using the maven repo's [groupId/artifactId "8RC1"] in my project.clj but no dice. I've msg'd the author on GitHub. Likely a quick fix and nothing to be upset over, but just so you're all aware...


I found that running

  lein deps
  lein uberjar
Resulted in a clean build for me. I guess you could also stick the JAR in the local maven repository too - flyingsaucer lives at https://xhtmlrenderer.dev.java.net/


I've uploaded the jar to the Downloads section and also added a note about Flying Saucer to the readme. Thanks for checking it out!


Of course it wouldn't be able to filter out embedded marketing / product placement, an increasingly popular paradigm for advertisement. (But who can filter that out when it's integrated well enough?)

Neat idea.


Sounds related to ariel: http://ariel.rubyforge.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: