My project: CSVGet -- Get structured data from sites as CSV

mc · on Aug 29, 2009

The real magic behind this gem is another one called Parsley, which is an astounding library for screen scraping.

http://github.com/fizx/parsley/tree/master

They've even built an entire app on top of the library: http://parselets.com/

tectonic · on Aug 28, 2009

Use http://selectorgadget.com to help make the selectors, too.

vbar · on Aug 29, 2009

But why mess with the selectors by hand? http://search.cpan.org/~vbar/HTML-ListScraper-0.05/ can discover the structure automatically (well, sometimes :-) - some pages just aren't regular enough, but it does work on HN, for example)...

hadley · on Aug 29, 2009

Wow - fantastic!

sgrove · on Aug 28, 2009

Wow, incredibly cool. I did the same thing with collection of curl / grep / sed / awk, and it was awful. I later redid it with some python library, and then with hpricot, and most recently with scrubyt. Each step was a little bit better, but I really should have been looking towards making a more generalized solution like this.

Well done!

skorgu · on Aug 28, 2009

A flag to output JSON would be nifty, then you could pass it through a filter [1] and get interweb-grep on steroids.

[1] http://goessner.net/articles/JsonPath/

fizx · on Aug 28, 2009

Um, you got it. :) I just added a "jsonget" binary to the package.

http://gist.github.com/177304

skorgu · on Aug 29, 2009

Now that's service!

Aschwin · on Aug 28, 2009

Oh my, this rocks. What would be even cooler is a library to use these functions. That would beat me to it, since I was thinking about a general data interface for websites too. If I could implement your effort into my code (PHP) that would be so awesome. Anyhow, nice idea though.

fizx · on Aug 28, 2009

http://github.com/fizx/parsley/tree/master is a C library which represents the core of the parsing functionality. A PHP binding is quite possible, and indeed, there already are bindings for Python and Ruby. In fact, I might just go look at how PHP bindings are done in general...

hadley · on Aug 29, 2009

This is fantastic. I do a load of data scraping of the web and this (plus selector gadget) is going to make my life sooooo much easier.