Nobody has really talked about what I think is an advantage just as powerful as ...

dynm · 2025-04-17T21:01:26 1744923686

The amount of text in books is surprisingly finite. My best estimate was that there are ~10¹³ tokens available in all books (https://dynomight.net/scaling/#scaling-data), which is less than frontier models are already being trained on. On the other hand, book tokens are probably much "better" than random internet tokens. Wikipedia for example seems to get much higher weight than other sources, and it's only ~3×10¹⁰ tokens.

dr_dshiv · 2025-04-17T21:18:37 1744924717

We need more books! On it…

elcritch · 2025-04-19T12:07:27 1745064447

> And further, by these, my son, be admonished: of making many books there is no end; and much study is a weariness of the flesh.

Ecclesiastes 12:12 ;)

kupopuffs · 2025-04-17T23:59:19 1744934359

opens up his favorite chat

paxys · 2025-04-17T21:11:31 1744924291

LibGen already exists, and all the top LLM publishers use it. I don't know if Google's own book index provides a big technical or legal advantage.

disgruntledphd2 · 2025-04-18T09:48:55 1744969735

I'd be very surprised if the Google books index wasn't much bigger and more diverse than libgen.

famouswaffles · 2025-04-18T16:01:04 1744992064

Anna's Archive is at 43M Books and 98M Papers [1]. The book total is nearly double what Google has.

Google's scanning project basically stalled after the legal battle. It's a very fascinating read [2].

[1] https://annas-archive.org/

[2] https://web.archive.org/web/20170719004247/https://www.theat...

jofzar · 2025-04-18T00:56:55 1744937815

Something that is not specifically called out but is also super relevant is actually the transcription of YouTube videos.

Every video is machine transcribed and stored and then for larger videos the author will often transcribed them themselves.

This is something they have already, it doesn't need any more "work" to get it vs a competitor.

jppittma · 2025-04-18T15:30:57 1744990257

I would think the biggest advantage is YouTube. There's a lot of modern content for analysis that's uncontaminated by LLMs.