If you mean "exactly as architected currently", then yes, current Transformer-based generative models can't possibly be anything other than a dead end. The architecture will need to change at least a little bit, to continue to make progress.
---
1. No matter how smart they get, current models are "only" pre-trained. No amount of "in-context learning" can allow the model to manipulate the shape and connectivity of the latent state-space burned into the model through training.
What is "in-context learning", if not real learning? It's the application of pre-learned general and domain-specific problem-solving principles to novel problems. "Fluid intelligence", you might call it. The context that "teaches" a model to solve a specific problem, is just 1. reminding the model that it has certain general skills; and then 2. telling the model to try applying those skills to solving this specific problem (which it wouldn't otherwise think to do, as it likely hasn't seen an example of anyone doing that in training.)
Consider that a top-level competitive gamer, who mostly "got good" playing one game, will likely nevertheless become nearly top-level in any new game they pick up in the same genre. How? Because many of the skills they picked up while playing their favored game, weren't just applicable to that game, but were instead general strategic skills transferrable to other games. This is their "fluid intelligence."
Both a human gamer and a Transformer model derive these abstract strategic insights at training time, and can then apply them across a wide domain of problems.
However, the human gamer can do something that a Transformer model fundamentally cannot do. If you introduce the human to a game that they mostly understand, but which is in a novel genre where playing the game requires one key insight the human has never encountered... then you will expect that the human will learn that insight during play. They'll see the evidence of it, and they'll derive it, and start using it. They will build entirely-novel mental infrastructure at inference time.
A feed-forward network cannot do this.
If there are strategic insights that aren't found in the model's training dataset, then those strategic insights just plain won't be available at inference time. Nothing the model sees in the context can allow it to conjure a novel piece of mental infrastructure from the ether to then apply to the problem.
Whether general or specific, the model can still only use the tools it has at inference time — it can't develop new ones just-in-time. It can't "have an epiphany" and crystallize a new insight from presented evidence. It's not doing the thing that allows that to happen at inference time — with that process instead exclusively occurring (currently) at training time.
And this is very limiting, as far as we want models to do anything domain-specific without having billion-interaction corpuses to feed them on those domains. We want models to work like people, training-wise: to "learn on the job."
We've had simpler models that do this for decades now: spam filters are trained online, for example.
I would expect that, in the medium term, we'll likely move somewhat away from pure feed-forward models, toward models with real online just-in-time training capabilities. We'll see inference frameworks and Inference-as-a-Service platforms that provide individual customers with "runtime-observed in-domain residual-error optimization adapters" (note: these would not be low-rank adapters!) for their deployment, with those adapters continuously being trained from their systems as an "in the small" version of the async "queue, fan-in, fine-tune" process seen in Inf-aaS-platform RLHF training.
And in the long term, we should expect this to become part of the model architecture itself — with mutable models that diverge from a generic pre-trained starting point through connection weights that are durably mutable at inference time (i.e. presented to the model as virtual latent-space embedding-vector slots to be written to), being recorded into a sparse overlay layer that is gathered from (or GPU-TLB-page-tree Copy-on-Write'ed to) during further inference.
---
2. There is a kind of "expressivity limit" that comes from generative Transformer models having to work iteratively and "with amnesia", against a context window comprised of tokens in the observed space.
Pure feed-forward networks generally (as all Transformer models are) only seem as intelligent as they are, because, outside of the model itself, we're breaking down the problem it has to solve from "generate an image" or "generate a paragraph" to instead be "generate a single convolution transform for a canvas" or "generate the next word in the sentence", and then looping the model over and over on solving that one-step problem with its own previous output as the input.
Now, this approach — using a pure feed-forward model (i.e. one that has constant-bounded processing time per output token, with no ability to "think longer" about anything), and feeding it the entire context (input + output-so-far) on each step, then having it infer one new "next" token at a time rather than entire output sequences at a time — isn't fundamentally limiting.
After all, models could just amortize any kind of superlinear-in-compute-time processing, across the inference of several tokens. (And if this was how we architected our models, then we'd expect them to behave a lot like humans: they'd be be "gradually thinking the problem through" while saying something — and then would sometimes stop themselves mid-sentence, and walk back what they said, because their asynchronous long-thinking process arrived at a conclusion, that invalidated previous outputs of their surface-level predict-the-next-word process.)
There's nothing that says that a pure feed-forward model needs to be stateless between steps. "Feed-forward" just means that, unlike in a Recurrent Neural Network, there's no step where data is passed "upstream" to be processed again by nodes of the network that have already done work. Each vertex of a feed-forward network is only visited (at most) once per inference step.
But there's nothing stopping you from designing a feed-forward network that, say, keeps an additional embedding vector between each latent layer, that isn't overwritten or dropped between layer activations, but instead persists outside the inference step, getting reused by the same layer in the next inference step, where the outputs of layer N-1 from inference-step T-1 are combined with the outputs of layer N-1 from inference-step T to form (part of) the input to layer N at inference-step T. (To have a model learn to do something with this "tool", you just need to ensure its training is measuring predictive error over multi-token sequences generated using this multi-step working-memory persistence.)
...but we aren't currently allowing models to do that. Models currently "have amnesia" between steps. In order to do any kind of asynchronous multi-step thinking, everything they know about "what they're currently thinking about" has to somehow be encoded — compressed — into the observed-space sequence, so that it can be recovered and reverse-engineered into latent context on the next step. And that compression is very lossy.
And this is why ChatGPT isn't automatically a better WolframAlpha. It can tell you how all the "mental algorithms" involved in higher-level maths work — and it can try to follow them itself — but it has nowhere to keep the large amount of "deep" [i.e. latent-space-level] working-memory context required to "carry forward" these multi-step processes between inference steps.
You can get a model (e.g. o1) to limp along by dedicating much of the context to "showing its work" in incredibly-minute detail — essentially trying to force serialization of the most "surprising" output in the latent layers as the predicted token — but this fights against the model's nature, especially as the model still needs to dedicate many of the feed-forward layers to deciding how to encode the chosen "surprising" embedding into the same observed-space vocabulary used to communicate the final output product to the user.
Given even linear context-window-size costs, the cost of this approach to working-memory serialization is superlinear vs achieved intelligence. It's untenable as a long-term strategy.
Obviously, my prediction here is that we'll build models with real inference-framework-level working memory.
---
At that point, if you're adding mutable weights and working memory, why not just admit defeat with Transformer architecture and go back to RNNs?
Predictability, mostly.
The "constant-bounded compute per output token" property of Transformer models, is the key guarantee that has enabled "AI" to be a commercial product right now, rather than a toy in a lab. Any further advancements must preserve that guarantee.
Write-once-per-layer long-term-durable mutable weights preserve that guarantee. Write-once-per-layer retained-between-inference-steps session memory cells preserve that guarantee. But anything with real recurrence, does not preserve that guarantee. Allowing recurrence in a neural network, is like allowing backward-branching jumps in a CPU program: it moves you from the domain of guaranteed-to-halt co-programs to the domain of unbounded Turing-machine software.
---
1. No matter how smart they get, current models are "only" pre-trained. No amount of "in-context learning" can allow the model to manipulate the shape and connectivity of the latent state-space burned into the model through training.
What is "in-context learning", if not real learning? It's the application of pre-learned general and domain-specific problem-solving principles to novel problems. "Fluid intelligence", you might call it. The context that "teaches" a model to solve a specific problem, is just 1. reminding the model that it has certain general skills; and then 2. telling the model to try applying those skills to solving this specific problem (which it wouldn't otherwise think to do, as it likely hasn't seen an example of anyone doing that in training.)
Consider that a top-level competitive gamer, who mostly "got good" playing one game, will likely nevertheless become nearly top-level in any new game they pick up in the same genre. How? Because many of the skills they picked up while playing their favored game, weren't just applicable to that game, but were instead general strategic skills transferrable to other games. This is their "fluid intelligence."
Both a human gamer and a Transformer model derive these abstract strategic insights at training time, and can then apply them across a wide domain of problems.
However, the human gamer can do something that a Transformer model fundamentally cannot do. If you introduce the human to a game that they mostly understand, but which is in a novel genre where playing the game requires one key insight the human has never encountered... then you will expect that the human will learn that insight during play. They'll see the evidence of it, and they'll derive it, and start using it. They will build entirely-novel mental infrastructure at inference time.
A feed-forward network cannot do this.
If there are strategic insights that aren't found in the model's training dataset, then those strategic insights just plain won't be available at inference time. Nothing the model sees in the context can allow it to conjure a novel piece of mental infrastructure from the ether to then apply to the problem.
Whether general or specific, the model can still only use the tools it has at inference time — it can't develop new ones just-in-time. It can't "have an epiphany" and crystallize a new insight from presented evidence. It's not doing the thing that allows that to happen at inference time — with that process instead exclusively occurring (currently) at training time.
And this is very limiting, as far as we want models to do anything domain-specific without having billion-interaction corpuses to feed them on those domains. We want models to work like people, training-wise: to "learn on the job."
We've had simpler models that do this for decades now: spam filters are trained online, for example.
I would expect that, in the medium term, we'll likely move somewhat away from pure feed-forward models, toward models with real online just-in-time training capabilities. We'll see inference frameworks and Inference-as-a-Service platforms that provide individual customers with "runtime-observed in-domain residual-error optimization adapters" (note: these would not be low-rank adapters!) for their deployment, with those adapters continuously being trained from their systems as an "in the small" version of the async "queue, fan-in, fine-tune" process seen in Inf-aaS-platform RLHF training.
And in the long term, we should expect this to become part of the model architecture itself — with mutable models that diverge from a generic pre-trained starting point through connection weights that are durably mutable at inference time (i.e. presented to the model as virtual latent-space embedding-vector slots to be written to), being recorded into a sparse overlay layer that is gathered from (or GPU-TLB-page-tree Copy-on-Write'ed to) during further inference.
---
2. There is a kind of "expressivity limit" that comes from generative Transformer models having to work iteratively and "with amnesia", against a context window comprised of tokens in the observed space.
Pure feed-forward networks generally (as all Transformer models are) only seem as intelligent as they are, because, outside of the model itself, we're breaking down the problem it has to solve from "generate an image" or "generate a paragraph" to instead be "generate a single convolution transform for a canvas" or "generate the next word in the sentence", and then looping the model over and over on solving that one-step problem with its own previous output as the input.
Now, this approach — using a pure feed-forward model (i.e. one that has constant-bounded processing time per output token, with no ability to "think longer" about anything), and feeding it the entire context (input + output-so-far) on each step, then having it infer one new "next" token at a time rather than entire output sequences at a time — isn't fundamentally limiting.
After all, models could just amortize any kind of superlinear-in-compute-time processing, across the inference of several tokens. (And if this was how we architected our models, then we'd expect them to behave a lot like humans: they'd be be "gradually thinking the problem through" while saying something — and then would sometimes stop themselves mid-sentence, and walk back what they said, because their asynchronous long-thinking process arrived at a conclusion, that invalidated previous outputs of their surface-level predict-the-next-word process.)
There's nothing that says that a pure feed-forward model needs to be stateless between steps. "Feed-forward" just means that, unlike in a Recurrent Neural Network, there's no step where data is passed "upstream" to be processed again by nodes of the network that have already done work. Each vertex of a feed-forward network is only visited (at most) once per inference step.
But there's nothing stopping you from designing a feed-forward network that, say, keeps an additional embedding vector between each latent layer, that isn't overwritten or dropped between layer activations, but instead persists outside the inference step, getting reused by the same layer in the next inference step, where the outputs of layer N-1 from inference-step T-1 are combined with the outputs of layer N-1 from inference-step T to form (part of) the input to layer N at inference-step T. (To have a model learn to do something with this "tool", you just need to ensure its training is measuring predictive error over multi-token sequences generated using this multi-step working-memory persistence.)
...but we aren't currently allowing models to do that. Models currently "have amnesia" between steps. In order to do any kind of asynchronous multi-step thinking, everything they know about "what they're currently thinking about" has to somehow be encoded — compressed — into the observed-space sequence, so that it can be recovered and reverse-engineered into latent context on the next step. And that compression is very lossy.
And this is why ChatGPT isn't automatically a better WolframAlpha. It can tell you how all the "mental algorithms" involved in higher-level maths work — and it can try to follow them itself — but it has nowhere to keep the large amount of "deep" [i.e. latent-space-level] working-memory context required to "carry forward" these multi-step processes between inference steps.
You can get a model (e.g. o1) to limp along by dedicating much of the context to "showing its work" in incredibly-minute detail — essentially trying to force serialization of the most "surprising" output in the latent layers as the predicted token — but this fights against the model's nature, especially as the model still needs to dedicate many of the feed-forward layers to deciding how to encode the chosen "surprising" embedding into the same observed-space vocabulary used to communicate the final output product to the user.
Given even linear context-window-size costs, the cost of this approach to working-memory serialization is superlinear vs achieved intelligence. It's untenable as a long-term strategy.
Obviously, my prediction here is that we'll build models with real inference-framework-level working memory.
---
At that point, if you're adding mutable weights and working memory, why not just admit defeat with Transformer architecture and go back to RNNs?
Predictability, mostly.
The "constant-bounded compute per output token" property of Transformer models, is the key guarantee that has enabled "AI" to be a commercial product right now, rather than a toy in a lab. Any further advancements must preserve that guarantee.
Write-once-per-layer long-term-durable mutable weights preserve that guarantee. Write-once-per-layer retained-between-inference-steps session memory cells preserve that guarantee. But anything with real recurrence, does not preserve that guarantee. Allowing recurrence in a neural network, is like allowing backward-branching jumps in a CPU program: it moves you from the domain of guaranteed-to-halt co-programs to the domain of unbounded Turing-machine software.