Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
My project: CSVGet -- Get structured data from sites as CSV (github.com/fizx)
52 points by fizx on Aug 28, 2009 | hide | past | favorite | 11 comments


The real magic behind this gem is another one called Parsley, which is an astounding library for screen scraping.

http://github.com/fizx/parsley/tree/master

They've even built an entire app on top of the library: http://parselets.com/


Use http://selectorgadget.com to help make the selectors, too.


But why mess with the selectors by hand? http://search.cpan.org/~vbar/HTML-ListScraper-0.05/ can discover the structure automatically (well, sometimes :-) - some pages just aren't regular enough, but it does work on HN, for example)...


Wow - fantastic!


Wow, incredibly cool. I did the same thing with collection of curl / grep / sed / awk, and it was awful. I later redid it with some python library, and then with hpricot, and most recently with scrubyt. Each step was a little bit better, but I really should have been looking towards making a more generalized solution like this.

Well done!


A flag to output JSON would be nifty, then you could pass it through a filter [1] and get interweb-grep on steroids.

[1] http://goessner.net/articles/JsonPath/


Um, you got it. :) I just added a "jsonget" binary to the package.

http://gist.github.com/177304


Now that's service!


Oh my, this rocks. What would be even cooler is a library to use these functions. That would beat me to it, since I was thinking about a general data interface for websites too. If I could implement your effort into my code (PHP) that would be so awesome. Anyhow, nice idea though.


http://github.com/fizx/parsley/tree/master is a C library which represents the core of the parsing functionality. A PHP binding is quite possible, and indeed, there already are bindings for Python and Ruby. In fact, I might just go look at how PHP bindings are done in general...


This is fantastic. I do a load of data scraping of the web and this (plus selector gadget) is going to make my life sooooo much easier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: