At some point in this scenario you outline, there will be so much ML generated content referencing generated content, the (already muddled) primary sources will be lost in the mix and the role of ML will be forced to synthesize sensible, meaningful content out of conflicting "truth".
What I'm trying to get at is that ML is currently terrible at contextualizing information, but in the future the successful knowledge-query engine will be the one that do the best job of wading through the explosion of digital content and pulling out the parts that are coherent and more connected to reality.
This program will need to be able to contextualize information to form a coherent model where knowledge is interconnected into a larger model of reality.
Easier said than done, we'll see if that is even possible. This is getting towards AGI.
My sense is a whole chunk of the internet is going to just get washed out with the tide as it gets demonetized by search, and that's the stuff most likely to be ML-generated. Meanwhile search engines like Google will start to laser in on fewer and fewer higher quality sources that have some signal, both through manual tuning and automation.
I think it's important to distinguish between ML-assisted and whole cloth ML generation too. I imagine many high quality content sources are going to be using ML-assisted writing tools; for instance, at my old news company (edsurge) the journalists spent most of their time gathering info and constructing the story, which was the real meat of the work. The writing and wordsmithing was the not operating at the top of their skillset, it was pretty inefficient manual labor, especially the drafts. So from content production perspective, ML could automate a lot of that away while letting the journalists tune the narrative and edit the output for better nuance - in other words, editing vs producing.
This also assumes a single output, but if you take the edtech lens and couple it with the power of ML and you could easily have any given news article have nearly infinite varieties based on reading level, language, and even past knowledge (by including or removing context). As someone interested in new mediums and innovating news, this is absolutely exciting to me. A really interesting question here is if there's an opportunity for collaboration between expert sources and search engines, such that for a given set of domain-specific queries the expert sources are in charge of fine-tuning the generated outputs.
Completely agree about contextualization. The automated detection of ML (the defense part) needs to be tuned toward trying to extract and analyze claims that are being made in the content, and a lot of ML content is nonsensical, or introduces wild, novel claims that lack evidence and alignment with other sources. Some big questions for the analysis are whether those claims are corroborated by other higher quality sources, whether relevant context is included, and whether its bringing anything new to the table.
In this kind of setup redundant sites will be ruthlessly demonetized by search. As you say, this also drives us in a direction toward expert systems with some probabilistic measurement of the truth of claims and completeness of context, which may also have actual experts at the controls of some of this (like above). This of course raises a lot of questions where unorthodox ideas may fit in this model.
I'm not an ML nor search expert though, my experience is more on production side.
This sounds scarily similar to descriptions of brains dealing with incomplete information. Good thing brains aren't keen on rationalizing prior beliefs in the face of new evidence or believing spurious things.
I’m not as convinced that “coherent” or “more connected to reality” will be how the value of mass market content will be measured in the future, which makes some of these issues less of a problem for the content generator.
I flip flop between seeing it as inevitable, and overly techno-optimistic. Can definitely see it both ways.
In the arms race between ML-copy generation and its detection, eventually the edge becomes "does this make sense in a coherent world-view", so there should be a natural pressure towards that outcome.
But maybe we're just not good enough at building these tools, and the incentives will align with an endless onslaught of digital sludge.
There's also the problem of: how does legitimate fiction, poetry, love stories, and art fit into a www where only the most logical is allowed to be seen? It could be a logically puritanical nightmare.
What I'm trying to get at is that ML is currently terrible at contextualizing information, but in the future the successful knowledge-query engine will be the one that do the best job of wading through the explosion of digital content and pulling out the parts that are coherent and more connected to reality.
This program will need to be able to contextualize information to form a coherent model where knowledge is interconnected into a larger model of reality.
Easier said than done, we'll see if that is even possible. This is getting towards AGI.