i think plausibly being able to use youtube video as training data was the major reason for google to buy youtube in the first place. i'd be very surprised if youtube terms of service actually prohibit google from doing this
also, while a lot of tim's thoughts are excellent, i strongly disagree with this part
> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paid
reading books, watching videos, and attending trainings or other performances are not rights reserved to copyright holders, and indeed the history of copyright law carefully and specifically excludes such activities from requiring copyright licenses. consequently copyright holders do not in fact get paid for them. the first sale doctrine means that, in the usa (where the nyt has filed their lawsuit), not only can copyright holders not charge people for reading books and watching videos, they can't even charge them from reselling used books and videos
this is fundamental to the freedom of thought and inquiry that underlie liberal civilization; it's not a minor detail
> i think plausibly being able to use youtube video as training data was the major reason for google to buy youtube in the first place
When I asked Eric Schmidt the "why did we spend so much money on buying youtube?" question his answer was "if it's the future of television, it was a bargain; if not, we overpaid."
There didn't seem to be any expectation among senior management at the time that it was anything other than "televisions carry advertisements, we want in on that market."
> plausibly ... training data was the major reason for google to buy youtube
I'd agree, and I'd also argue people were totally cool with that until LLM/GenAI happened.
Somehow it's cool and exciting if you fed YouTube data to reconstruct historical artifacts, prototyped self driving car software, trained super-resolution algorithm, so on, but not GenAI. It's a different thing altogether. It's a double standard, or at least a set of criteria with a hidden decisive criterion.
Just IMO, I think that "double" standard has to be discussed more. It's supposedly about copyright but something is off, and it's definitely not about monetary compensation(individual works of art nor collective income support). There's something else with GenAI/LLM that make people want it gone.
e: anecdotal datapoint that people were cool about AI until LLM/GenAI/OpenAI[1] - no talks of safety, training data provenance, societal harm, nothing negative whatsoever from a digital camera news-blog - and it's about a Diffusion model:
Enhance! Google researchers detail new method for upscaling low-resolution images with impressive results
Published Aug 30, 2021 | Gannon Burgett
[...] Or is it? A new blog post on the Google AI Blog showcases a new technology its developed to upscale low-resolution images with incredible results.
Could it be a property of the transformative nature of those non-GenAI models? Using the data to create self driving systems or enhance existing works is adding value to the pool of work. It takes the copy written data and creates something new. GenAI, by comparison, seems to devalue existing works. It takes the same data and creates competing works at best, straight up copies at worst.
> There's something else with GenAI/LLM that make people want it gone.
Generative AI is in the news a lot right now because clicks. "AI will take your job" is out there a lot, possibly because "writer of low resolution news articles" is what it probably could replace, and so the writers have it on their minds.
I doubt using it as training data was specifically the goal, but Google has always believed more data = more profit over the long term. This is why Gmail launched with unlimited storage.
I remember that Gmail was released on April 1st (on purpose), and many people thought it was a joke because it came with 1GB of storage, while places like Yahoo had like 20MB.
It technically had a limit, but I remember thinking of it as unlimited because the amount of storage available counted up at a faster rate than I was saving emails.
Maybe you're right. It's been a while. Either way, the philosophy was always to make money off the data somehow even if we didn't know how at the time.
“And counting”, famously! I remember watching that counter crawl up and up in disbelief, when Hotmail and local alternatives were offering at most 10MB on their free plans.
At EFF we were arguing with Google about their permanent collection of user data early (I joined in 2005, and we already were putting pressure on them then). Whenever we asked, said they were sure it would come in useful for improving their services. Google just institutionally strongly believed in the value of data.
I sometimes wonder if the form of our current machine-learning boom is actually based on that conviction and the determined search for applications, rather than modern AI being a vindication of that strategy. A bit like Moore's Law: is it an iron rule of technology, or just a way to coordinate a huge amount of resources across an industry?
It was a long time ago that I read a history on this, and I might be missing a detail, but the gist was Google's investors were clamoring for profits following the .com crash and Google realized the data they had was a gold mine if they could just figure out how to apply ML to it.
They tried really hard and did okay for a while using it for advertising, but Doubleclick did it better, so they bought it in 2008.
google has been an ai company since way before ai was fashionable. they hired norvig before i first met him in 02001. you can find dekhn comments here about larry and sergey talking with him about the central importance of ai back last millennium. i've also heard it myself from other early googlers (though not larry and sergey)
Occam's razor. Google has always been an ads company. YouTube had big ad potential. Saying it was for the training data for AI that wouldn't exist until over a decade later ignores the obvious.
also, while a lot of tim's thoughts are excellent, i strongly disagree with this part
> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paid
reading books, watching videos, and attending trainings or other performances are not rights reserved to copyright holders, and indeed the history of copyright law carefully and specifically excludes such activities from requiring copyright licenses. consequently copyright holders do not in fact get paid for them. the first sale doctrine means that, in the usa (where the nyt has filed their lawsuit), not only can copyright holders not charge people for reading books and watching videos, they can't even charge them from reselling used books and videos
this is fundamental to the freedom of thought and inquiry that underlie liberal civilization; it's not a minor detail