I'd invite you to read "Foundation Models and Fair Use"[1] which is a paper written as a collaboration between Standford's law school and computer science department.
It talks at length about this specific problem and migration techniques for it:
Existing foundation models are trained on copyrighted material. Deploying these models can pose
both legal and ethical risks when data creators fail to receive appropriate attribution or compensation.
In the United States and several other countries, copyrighted content may be used to build foundation
models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model
produces output that is similar to copyrighted data, particularly in scenarios that affect the market
of that data, fair use may no longer apply to the output of the model. In this work, we emphasize
that fair use is not guaranteed, and additional work may be necessary to keep model development
and deployment squarely in the realm of fair use. First, we survey the potential risks of developing
and deploying foundation models based on copyrighted content. We review relevant U.S. case law,
drawing parallels to existing and potential applications for generating text, source code, and visual art.
Experiments confirm that popular foundation models can generate content considerably similar to
copyrighted material. Second, we discuss technical mitigations that can help foundation models stay
in line with fair use. We argue that more research is needed to align mitigation strategies with the
current state of the law.
Sounds like you're agreeing they are legally in murky territory.
Further, new laws get made in reaction to new things whenever they push an existing doctrine beyond the original ruling, and these are certainly in that territory.
> Sounds like you're agreeing they are legally in murky territory.
Of course. As I said originally "This is clearly not a given". It's very unclear how this will be decided, but anyone who thinks that just because models contain copyrighted data they don't have a leg to stand on is very wrong. There are multiple good arguments and precedents to show that they do, depending on the circumstances.
Huh, I still interpret “this is clearly not a given” as a being contradictory reply to me saying it’s murky territory.
I think they contain massive amounts of copyrighted data, and reproduce them exactly, and that’s why they don’t have a leg to stand on. It’s a personal opinion, and I think backed by your citation. But thanks for the reference there, and glad to chat.
(Additionally, newer LLMs like Perplexity.AI's correctly cite content sources, so that is even more similar to search engines)