> To your point, I have wondered whatever became of that massive initiative from...

glenstein · on Dec 23, 2024

I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Maybe I wasn't clear, but I was interested in the consequences of the legal stuff. It's not clear from the wiki article what any of this means with respect to the suitability of scans for AI training.

ben_w · on Dec 23, 2024

> I don't know what you mean by timing (relative to what?) or "simple indexing" (they scanned the complete contents of books), but I am, and was already aware, of the wiki article and the role of recaptcha.

Timing as in: it started in 2004, when the most advanced AI most people used was a spam filter, so it wasn't seen as a training issue (in the way that LLMs are) *at the time*.

As for training rights, I agree with you, there's no clarity for how such data could be used *today* by the people who have it. Especially as the arguments in favour of LLM training are often by comparison to search engine indexing.

fragmede · on Dec 23, 2024

Until such time as a lawsuit declares otherwise, Google's position is obviously that scanning books, OCRing them, saving that text in a database, and using that to allow searching is no different, legally, than scanning books, OCRing them, saving that text in to a database, and using that to train LLMs. Book publishers already went up against Google for the practice of scanning in the first place, we'll see if they try again with LLM training.