By the same legal argument, couldn't anyone buy one of AggData's datasets and th...

skolor · on Sept 24, 2009

You couldn't distribute their dataset, but you should be able to modify it and resell it.

My guess would be that individual datasets like this would not be worth trying to resell, and if they were it would likely be cheaper to generate them yourself from the sites. For example, taking a look at their most recent dataset (gamestop store locations) it looks to be a simple matter of:

* Find the Gamestop store locator (http://web.sa.mapquest.com/gamestop/?tempset=search, linked to on www.gamestop.com). * Plug in list of US Zip codes (I suppose you would need one of these for any scraping project of this type, might as well buy it some place). * Scrape the page that is returned. * They also include Longitude and Latitude. They must have a seperate database they run the address against to generate that, or it may be encoded into the Mapquest applet.

All in all, for someone who is skilled in data scraping, and already has a scraping tool set up, its a matter of ~20 minutes worth of coding, followed by (maybe) an hour of scraping. Unless you feel your time is worth >$120 per hour, you would be better off scraping it yourself than purchasing it to resell.

kbrower · on Sept 24, 2009

You can get a zip code database for free: http://www.populardata.com/zipcode_database.html

frig · on Sept 24, 2009

Yes + no.

NO: The kicker is that if you did that they'd sue you for breach of contract.

YES: once you passed it off to third parties (in violation of contract) it's not clear how strong of a remedy AggData actually has.

Some of the notions + folk wisdom about copyrightability of facts is based on assumptions that increasingly don't hold.

EG: you can't really copyright the factual contents of the phonebook; if I want to compete with the existing phonebook publisher by retyping the phone book they don't have a copyright claim against me (provided I only transcribe the facts, and organize the information in an obvious or mechanical fashion, like alphabetical ordering).

If you were instead to compete with an existing phone book company by literally xeroxing their product they could probably take you to trial (on a theory that the underlying facts aren't under copyright but the specific page layouts and so on are; there's also the issue of the ads you'd be xeroxing but let's not muddle things overmuch).

If this has already happened and been litigated I've never heard of it.

What something like AggData is doing shows some of the conceptual limits of the existing framework:

- existing physical instantiations of abstract "databases" (collections of fact)

-- (1) couldn't be economically "xeroxed" (EG: if you do it cheaply it is visibly inferior-looking; if you do a very high fidelity reproduction it's about as costly as just re-doing it from scratch)

-- (2) had enough "wiggle room" in how they might be represented in a human-friendly medium such that:

--- (A) on the one hand there's the possibility of a viable copyright claim against a "xeroxer" (under the theory that the page layout is under copyright even if the facts themselves are not)

--- (B) on the other hand allowing for the possibility of "retypers" to actually take advantage of the not-copyrighted status of the underlying facts and actually produce a different product (b/c it is possible to reproduce the same facts with a format sufficiently-different from the source you drew them from)

- but as the "database" becomes increasingly digital

-- (1) "xeroxing" is very economical (far more so than "retyping")

-- (2) there's increasingly less meaningful "wiggle room" as to how the facts might be represented in a "database"; changes-of-format dont' do much, but once the data is shorn of its human-friendly formatting all the useful ways of storing it are essentially isomorphic, meaning (2.B) above is increasingly unlikely (it may no longer be possible to "clone" the abstract data without being too close to the exact format of the source for legal protection).

If someone's actually seen these issues played out or "settled" I'd love to learn more about it.

qeorge · on Sept 24, 2009

This is an excellent explanation, and gels with what I've understood on the topic.

The most relevant case I know is Feist Publications v. Rural Telephone Service:

http://en.wikipedia.org/wiki/Feist_Publications_v._Rural_Tel...

in which it was determined legal for one phonebook to copy the listings of another in order to create an aggregate directory.

cschneid · on Sept 24, 2009

To summarize for my own learning, you can't copyright facts themselves, but you can copyright the representation of facts (page layout, etc).

The question then is if a CSV file counts as "page layout". How customized does something have to be, before it is a creative work? I doubt CSV goes far enough. I wonder if the ordering and presence of columns in the CSV is enough?

I believe I heard a story about a dictionary suing based on the page numbers being copyrightable, but I can't verify that via google. If that is in fact true, then perhaps a CSV is a creative work protected under copyright.

Boy is copyright broken when faced with computers....

frig · on Sept 24, 2009

Yeah basically accurate so far as I know (disclaimer: for awhile I worked @ a publisher of reference works; we had rules-of-thumb explanations for what we could get away with but I'm not an expert).

The real story behind the story you might've heard would've been more like a page-layout type defense with the fact that the contents of each page (by #) were identical between two competing works; throw in the usual misreporting and there you go.

If you track it down please do share.

The CSV thing is really where it gets tricky: generally "trivial" obfuscations or rearrangements to get around existing laws don't fare well in court (and sadly even "trivial" isn't really well-defined), so if a CSV "counts" then rearranging the columns (or randomly sorting the rows, etc.) won't help you. But it's not obvious that a CSV doesn't count either.

Basically the model is supposed to be:

- you see a printed pages containing representations of facts

- you "learn" the facts (eg: "copied" them to your brain)

- you produce printed pages containing representations of facts

...and of course there may be incidental and/or unavoidable resemblances between the two representations but insofar as what you "copied" was just the facts you had fair game.

It's sort-of tacitly assumed that if you did make an exact copy it'd be pretty obvious and that if you went through the above process what you did would look different enough to be pretty obvious also (unless you deliberately cloned something the hard way, which is stupid enough to be rare).

With straight-digital "database" dumps (like the CSV) you have a situation in which if you went through the full process you'd create something that's pretty much indistinguishable from what you'd get if you just hit ctrl-d; this pretty much breaks the rules of thumb / intuition behind the rules around "facts".

cschneid · on Sept 24, 2009

I think I found it, it was a law book.

http://lists.essential.org/1995/info-policy-notes/msg00019.h...

frig · on Sept 24, 2009

YUP, and it's not quite why I guessed but the situation is pretty similar: non-copyrightable content (legal opinions), successful use of copyright claims over the arrangement.

Incidentally a good historical anecdote; in addition to many aspects in-and-around copyright I tend to think the thinking around public records laws is woefully antiquated (to our general detriment).

Thanks for finding that.

wallflower · on Sept 24, 2009

I imagine that you have to sign something/checkbox an agreement when you purchase a dataset that makes you legally and/or financially at-fault if you are discovered to be the leak/source of the leak