AggData: Datasets created from scraping the web

weirdwes · on Sept 24, 2009

Before the App Store hit and iPhone web applications were all the rage, I started working on a restaurant locator. Oddity Software was a company I came across that provides datasets like the ones from AggData, though I'm not sure if it's from scraping the web. I figured I'd give it a mention in case people came here searching for additional resources. They definitely have more in the way of free lists (http://www.odditysoftware.com/free_lists.html), though I can't personally vouch for accuracy or timely updates as I haven't used them.

Listable (http://www.listable.org) is another list type service, though it's lists are much less complex and are user created.

I'll be adding AggData to my bookmarks, though. I could see myself using at least one of their "FreeData" lists in the future and possible some of their paid ones.

aggdata · on Sept 24, 2009

Wow, this discussion is way deeper than we have ever gotten into at AggData. In fact, "frig", I think we may need to hire you. :) We have been very particular in the type of data we collect for some of these very reasons, and we feel that the location data was enough in the public domain to protect us from infringement allegations. We don't currently have much in place to pursue those trying to resell our data, and it hasn't really been a problem yet. I think, like mentioned, it doesn't make much economical sense.

A couple of other quick responses: yes, we know our search is kind of lacking now, and we're working to fix it. Also, we have major plans of offering bulk data and specific regional data; we're currently just working on expanding our library, though.

Thank you, everyone, for your insight! -Chris Hathaway, AggData LLC

(and seriously, frig, send us a message on our contact page, I have more questions for you)

toppy · on Sept 24, 2009

Hey, AggData guys, why not change business model and sell your data in bulks? Wouldn't be nice to use it that way?

from aggdata.dealership_locations import cadillac

print "Cadillac Dealers in NY:"

for loc in cadillac:

    if loc.city == 'New York':

        print loc.address, loc.phonenumber

timmaah · on Sept 24, 2009

Their "FreeData" sets could use some attention.

I realize it is free, but if you are going to have it as an example of what you do, have it correct and up to date.

The headers for the congress data are completely off, and is not current.. Franken, Kennedy.. etc..

vijayr · on Sept 24, 2009

If you are interested in data, here are some sites to get them from

http://theinfo.org/get/data

http://infochimps.org

http://developer.amazonwebservices.com/connect/kbcategory.js...

http://ckan.net

EDIT: Comprehensive list here http://www.datawrangling.com/some-datasets-available-on-the-...

utnick · on Sept 24, 2009

I have looked into making a business like this before, there are quite a few of them and I do like scraping.

But don't you have to break a lot of 'terms of use' agreements to scrape this data? Could you get in legal trouble for that?

cschneid · on Sept 24, 2009

Maybe, but there's a decent legal argument that it's not copyrightable data (facts of where stuff is for example), and that it's publicly posted ("find a store near you" links).

I suppose it gets more and more fuzzy if you move into non-fact data, and you open yourself up more to lawsuits.

evgen · on Sept 24, 2009

The facts themselves are not subject to copyright, although if some business decides it does not like how its TOS is being violated it would probably stuff a few "fake" entries in there that would be returned to IPs that trigger a spider warning and then later go back and see if they were put into the db; if so you go after aggdata for violating a copyright on the collection as a whole. The data being provided by the businesses that are being scraped is not much different than that provided by a map service provider or phone book provider -- the individual facts are not subject to copyright, but the collection as a whole is and if aggdata is basically doing the web equivalent of photocopying a phone book and selling it to you then they are open to attacks from this vector.

cschneid · on Sept 24, 2009

"The collection as a whole" isn't necessarily copyrightable as far as I know (not that I'm anywhere near a lawyer...). I think curated collections (say, of short stories) are copyrightable since it requires artistic decisions to be made, but a website like McDonalds that has a big list of locations is just a statement of fact, with no artistic basis.

Maps have artwork that is protected, and phone books have page layout, font choice, etc. I don't think that textual addresses of a set of facts would be protected, but I think that it's a gray area of law that hasn't been decided.

evgen · on Sept 24, 2009

It is not just the "artwork" of a map that is protected, or else you would see people turning online map databases into collections of lines and points. The entire collection, as a whole, can be subject to a compilation copyright. This is why mapmakers put in little bits of fake data, so that if their fake data turns up in your map they can nail you for copyright infringement.

lonestar · on Sept 24, 2009

By the same legal argument, couldn't anyone buy one of AggData's datasets and then publish it for free on the internet?

skolor · on Sept 24, 2009

You couldn't distribute their dataset, but you should be able to modify it and resell it.

My guess would be that individual datasets like this would not be worth trying to resell, and if they were it would likely be cheaper to generate them yourself from the sites. For example, taking a look at their most recent dataset (gamestop store locations) it looks to be a simple matter of:

* Find the Gamestop store locator (http://web.sa.mapquest.com/gamestop/?tempset=search, linked to on www.gamestop.com). * Plug in list of US Zip codes (I suppose you would need one of these for any scraping project of this type, might as well buy it some place). * Scrape the page that is returned. * They also include Longitude and Latitude. They must have a seperate database they run the address against to generate that, or it may be encoded into the Mapquest applet.

All in all, for someone who is skilled in data scraping, and already has a scraping tool set up, its a matter of ~20 minutes worth of coding, followed by (maybe) an hour of scraping. Unless you feel your time is worth >$120 per hour, you would be better off scraping it yourself than purchasing it to resell.

kbrower · on Sept 24, 2009

You can get a zip code database for free: http://www.populardata.com/zipcode_database.html

frig · on Sept 24, 2009

Yes + no.

NO: The kicker is that if you did that they'd sue you for breach of contract.

YES: once you passed it off to third parties (in violation of contract) it's not clear how strong of a remedy AggData actually has.

Some of the notions + folk wisdom about copyrightability of facts is based on assumptions that increasingly don't hold.

EG: you can't really copyright the factual contents of the phonebook; if I want to compete with the existing phonebook publisher by retyping the phone book they don't have a copyright claim against me (provided I only transcribe the facts, and organize the information in an obvious or mechanical fashion, like alphabetical ordering).

If you were instead to compete with an existing phone book company by literally xeroxing their product they could probably take you to trial (on a theory that the underlying facts aren't under copyright but the specific page layouts and so on are; there's also the issue of the ads you'd be xeroxing but let's not muddle things overmuch).

If this has already happened and been litigated I've never heard of it.

What something like AggData is doing shows some of the conceptual limits of the existing framework:

- existing physical instantiations of abstract "databases" (collections of fact)

-- (1) couldn't be economically "xeroxed" (EG: if you do it cheaply it is visibly inferior-looking; if you do a very high fidelity reproduction it's about as costly as just re-doing it from scratch)

-- (2) had enough "wiggle room" in how they might be represented in a human-friendly medium such that:

--- (A) on the one hand there's the possibility of a viable copyright claim against a "xeroxer" (under the theory that the page layout is under copyright even if the facts themselves are not)

--- (B) on the other hand allowing for the possibility of "retypers" to actually take advantage of the not-copyrighted status of the underlying facts and actually produce a different product (b/c it is possible to reproduce the same facts with a format sufficiently-different from the source you drew them from)

- but as the "database" becomes increasingly digital

-- (1) "xeroxing" is very economical (far more so than "retyping")

-- (2) there's increasingly less meaningful "wiggle room" as to how the facts might be represented in a "database"; changes-of-format dont' do much, but once the data is shorn of its human-friendly formatting all the useful ways of storing it are essentially isomorphic, meaning (2.B) above is increasingly unlikely (it may no longer be possible to "clone" the abstract data without being too close to the exact format of the source for legal protection).

If someone's actually seen these issues played out or "settled" I'd love to learn more about it.

qeorge · on Sept 24, 2009

This is an excellent explanation, and gels with what I've understood on the topic.

The most relevant case I know is Feist Publications v. Rural Telephone Service:

http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Tel...

in which it was determined legal for one phonebook to copy the listings of another in order to create an aggregate directory.

cschneid · on Sept 24, 2009

To summarize for my own learning, you can't copyright facts themselves, but you can copyright the representation of facts (page layout, etc).

The question then is if a CSV file counts as "page layout". How customized does something have to be, before it is a creative work? I doubt CSV goes far enough. I wonder if the ordering and presence of columns in the CSV is enough?

I believe I heard a story about a dictionary suing based on the page numbers being copyrightable, but I can't verify that via google. If that is in fact true, then perhaps a CSV is a creative work protected under copyright.

Boy is copyright broken when faced with computers....

frig · on Sept 24, 2009

Yeah basically accurate so far as I know (disclaimer: for awhile I worked @ a publisher of reference works; we had rules-of-thumb explanations for what we could get away with but I'm not an expert).

The real story behind the story you might've heard would've been more like a page-layout type defense with the fact that the contents of each page (by #) were identical between two competing works; throw in the usual misreporting and there you go.

If you track it down please do share.

The CSV thing is really where it gets tricky: generally "trivial" obfuscations or rearrangements to get around existing laws don't fare well in court (and sadly even "trivial" isn't really well-defined), so if a CSV "counts" then rearranging the columns (or randomly sorting the rows, etc.) won't help you. But it's not obvious that a CSV doesn't count either.

Basically the model is supposed to be:

- you see a printed pages containing representations of facts

- you "learn" the facts (eg: "copied" them to your brain)

- you produce printed pages containing representations of facts

...and of course there may be incidental and/or unavoidable resemblances between the two representations but insofar as what you "copied" was just the facts you had fair game.

It's sort-of tacitly assumed that if you did make an exact copy it'd be pretty obvious and that if you went through the above process what you did would look different enough to be pretty obvious also (unless you deliberately cloned something the hard way, which is stupid enough to be rare).

With straight-digital "database" dumps (like the CSV) you have a situation in which if you went through the full process you'd create something that's pretty much indistinguishable from what you'd get if you just hit ctrl-d; this pretty much breaks the rules of thumb / intuition behind the rules around "facts".

cschneid · on Sept 24, 2009

I think I found it, it was a law book.

http://lists.essential.org/1995/info-policy-notes/msg00019.h...

frig · on Sept 24, 2009

YUP, and it's not quite why I guessed but the situation is pretty similar: non-copyrightable content (legal opinions), successful use of copyright claims over the arrangement.

Incidentally a good historical anecdote; in addition to many aspects in-and-around copyright I tend to think the thinking around public records laws is woefully antiquated (to our general detriment).

Thanks for finding that.

wallflower · on Sept 24, 2009

I imagine that you have to sign something/checkbox an agreement when you purchase a dataset that makes you legally and/or financially at-fault if you are discovered to be the leak/source of the leak

MrMatt · on Sept 24, 2009

I don't think that something being a fact makes it exempt from copyright. Map data is copyrighted, and often has minor inaccuracies in order to indicate copyright infringement.

frig · on Sept 24, 2009

CF my longer post. Basically in the USA at least (EU has "database directive") the situation is that "facts" aren't themselves copyrightable but the intuitions / guidelines non-experts (like me) have to go from are very much pre-internet and pre-computers.

In the map case: if I make a map of the coast of Florida by:

- looking at maps A, B, and C

- drawing my own map based on what I learned

...then the publishers of A (B or C) can't come sue me for violating their copyright over the shape of the coast of Florida; they have copyright over their specific depiction of that shape but not the shape itself.

Once you move into all-digital datasets a lot of the grounding assumptions are no longer there (perfect reproduction is easy; the data is more abstract and may only really be representable in one way).

ashishk · on Sept 24, 2009

I think that's something to worry about if sales grow to significant levels.

There are probably several solutions for that problem.

mitko · on Sept 24, 2009

Only "Locations" kinds of data? And the search is awful- it couldn't find anything for McDonalds for example (http://aggdata.com/search/node/McDonalds)

I was hoping to use it as a possible alternative to http://archive.ics.uci.edu/ml/ for ML data sets but now I am kind of disappointed.

jrwoodruff · on Sept 24, 2009

I just used the search to find 'mcdonald' and it came right up with the McDonalds data.

timmaah · on Sept 24, 2009

He forget the '

searching for mcdonald's works fine as well.