Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nobody has really talked about what I think is an advantage just as powerful as the custom chips: Google Books. They already won a landmark fair use lawsuit against book publishers, digitized more books than anyone on earth, and used their Captcha service to crowdsource its OCR. They've got the best* legal cover and all of the best sources of human knowledge already there. Then Youtube for video.

The chips of course push them over the top. I don't know how much Deep Research is costing them but it's by far the best experience with AI I've had so far with a generous 20/day rate limit. At this point I must be using up at least 5-10 compute hours a day. Until about a week ago I had almost completely written off Google.

* For what it's worth, I don't know. IANAL



The amount of text in books is surprisingly finite. My best estimate was that there are ~10¹³ tokens available in all books (https://dynomight.net/scaling/#scaling-data), which is less than frontier models are already being trained on. On the other hand, book tokens are probably much "better" than random internet tokens. Wikipedia for example seems to get much higher weight than other sources, and it's only ~3×10¹⁰ tokens.


We need more books! On it…


> And further, by these, my son, be admonished: of making many books there is no end; and much study is a weariness of the flesh.

Ecclesiastes 12:12 ;)


opens up his favorite chat


LibGen already exists, and all the top LLM publishers use it. I don't know if Google's own book index provides a big technical or legal advantage.


I'd be very surprised if the Google books index wasn't much bigger and more diverse than libgen.


Anna's Archive is at 43M Books and 98M Papers [1]. The book total is nearly double what Google has.

Google's scanning project basically stalled after the legal battle. It's a very fascinating read [2].

[1] https://annas-archive.org/

[2] https://web.archive.org/web/20170719004247/https://www.theat...


Something that is not specifically called out but is also super relevant is actually the transcription of YouTube videos.

Every video is machine transcribed and stored and then for larger videos the author will often transcribed them themselves.

This is something they have already, it doesn't need any more "work" to get it vs a competitor.


I would think the biggest advantage is YouTube. There's a lot of modern content for analysis that's uncontaminated by LLMs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: