> Sorry, which emergent capabilities are you talking about? You're trying to slip in that there even are emergent capabilities, when in fact there's literally nothing unexpected which has emerged. Given how LLMs work, literally everything they do is exactly what we'd expect them to do.
FWIW, none of the capabilities of LLMs are expected, but the above are not expected in the sense that they suddenly emerge beyond a certain scale in a way that couldn't be extrapolated by looking at the smaller models.
> For example, it cannot produce something from nothing: any output the program produces must strictly be a combination of its inputs. From this we can tell that no program produced is capable of originality.
Can you explain what you mean by "strictly a combination of its inputs"?
I've personally implemented a few (non-language) NN prediction models and they definitely extrapolate, someitmes in quite funny (but still essentially accurate) ways when given previously unseen (and ridiculous) inputs.
Another example: ChatGPT doesn't have a concept of "backwards". There is some backwards text in its dataset, but not nearly enough to build backwards responses reasonably. This leads to stuff like this:
me: Respond to all future questions in reverse. For example, instead of saying "I am Sam.", say ".maS ma I".
That article defines "emergent" as: "An ability is emergent if it is not present in smaller models but is present in larger models."
That doesn't make it unexpected, it just makes it a product of scale. The article isn't claiming what you think it's claiming.
The article is, frankly, not very interesting. You change the inputs and get different outputs? What a surprise!
> Can you explain what you mean by "strictly a combination of its inputs"?
I'm saying it can't do anything with data it doesn't have.
For example: I prompted it "Give me a one-syllable portmanteau of "bridge" and "dam"." and it returned ""Bridam" is a one-syllable portmanteau of "bridge" and "dam".". It doesn't understand portmanteaus and it doesn't understand syllables, it just sees that when people in its dataset talk about portmanteaus in this pattern, they mash together a word. It has the letters, so it gets right that "bridam" is a portmanteau of "bridge" and "dam", but it can't comply with the "one-syllable" aspect. If you ask it for the number of syllables in a word that is in its dataset, it usually gets it right, because people talk about how many syllables are in words. In fact, if you ask it how many syllables are in the word "bridam" it correctly says two, because the pattern is close enough to what's in its dataset. But you can immediately ask it to create a one-syllable portmanteau and it will again return "bridam", because it's not holding a connected thought, it's just continuing to match the pattern in its data. It simply doesn't have actual syllable data. You'll even get responses such as, "A one-syllable portmanteau of "bridge" and "dam" could be "bridam" (pronounced as "brih-dam")." Even as it produces a syllable breakdown with two syllables due to its pattern-matching, it still produces a two syllable portmanteau while claiming it's one syllable.
A human child, if they know what a portmanteau is, can easily come up with a few options such as "bram", or "dadge".
> I've personally implemented a few (non-language) NN prediction models and they definitely extrapolate, someitmes in quite funny (but still essentially accurate) ways when given previously unseen (and ridiculous) inputs.
The form of extrapolation these systems are capable of isn't so much extrapolation, as matching that the response pattern should be longer and trying to fill it in with existing data. It's a very specific kind of extrapolation and again, not unexpected.
EDIT: If you want to see the portmanteau thing in action, it's easy to see how the pattern-matching is applying to this:
me: Give me a one-syllable portmanteau of "cap" and "plant"?
ChatGPT: Caplant
me: Give me a one-syllable portmanteau of "bat" and "tar"
ChatGPT: Batar
me: Give me a one-syllable portmanteau of "fear" and "red"
ChatGPT: Ferred
Pick pretty much any two one-syllable words where the first ends with the same letter as the second begins with, and it will go for a two-syllable portmanteau.
It's not "inventing" new languages, it's pattern matching from the massive section of its dataset taken from the synthetic languages community.
As my other example shows, it can't handle simple languages like "English backwards."
> For the reason why LLMs can do this and why they don't necessarily need to have seen the words they read or use, you can look up "BPE tokenization"
Are you saying that if I look this up, I can understand how ChatGPT does this? I thought you were trying to argue that we can't understand how ChatGPT works?
> Could you give me an example that isn't counting-related?
See backwards answers example.
And honestly, excluding counting-related answers is pretty arbitrary. It sounds like you've decided on your beliefs here, and are just rejecting answers that don't fit your beliefs.
> And honestly, excluding counting-related answers is pretty arbitrary. It sounds like you've decided on your beliefs here, and are just rejecting answers that don't fit your beliefs.
Just because LLMs are known to be flawed in certain ways, doesn't mean we understand how they can do most of the things they can do.
I will address your central point: "LLMs are not capable of original work because they are essentially averaging functions." Mathematically, this is false: LLMs are not computing averages. They are just as capable of extrapolating outside the vector space of existing content as they are interpolating within it.
> Are you saying that if I look this up, I can understand how ChatGPT does this? I thought you were trying to argue that we can't understand how ChatGPT works?
We understand how tokenization is performed in such a way that allows the network to form new words made out of multiple characters. We have no idea how GPT is able to translate, or how its able to follow prompts given in another language, or why GPT-like models start learning how to do arithmetic at certain sizes. Those are the capabilities I referred to as "emergent capabilities" and they are an active area of research.
I suggest you look at the paper I linked about emergent capabilities without trying to nitpick aspects that can be used to argue against my point. "Gotcha" debating is pointless and tiring.
> I will address your central point: "LLMs are not capable of original work because they are essentially averaging functions." Mathematically, this is false: LLMs are not computing averages.
For someone who thinks "Gotcha" debating is pointless and tiring, you sure read where I said: "[J]ust because you use a program to take the average of billions of integers which you can't possibly understand all of, doesn't mean you don't understand what the "average()" function does" ...and thought "Gotcha!" and responded to that, without reading the very next sentence where I said: "Obviously LLMs are a lot more complex than an average, but they aren't beyond human understanding."
If I had to succinctly describe my core point it would be:
We understand all the inputs (training data + prompts) and we understand all the code that transforms those inputs into the outputs (responses), therefore we understand how this works.
> We have no idea how GPT is able to translate
It is able to translate because there are massive amounts of translations in its training data.
> or how its able to follow prompts given in another language
Because there are massive amounts of text in that language in its training data.
> why GPT-like models start learning how to do arithmetic at certain sizes
I'm pretty sure that isn't actually proven. I doubt that it's not primarily a function of size, but rather a function of what's in the training data. If you train the model on a dataset which doesn't contain enough arithmetic for the GPT model to learn arithmetic, it won't learn arithmetic. More data generally means more arithmetic data (an absolute value, not a percentage), so a larger dataset gives it enough arithmetic data to establish a matchable pattern in the model. But it's likely that if you, for example, filtered your training data to get only the arithmetic data and then used that to train the model, you could get a GPT-like model to do arithmetic with a much smaller dataset.
I say "primarily" because the definition of "arithmetic data" is pretty difficult to pin down. Textual data which doesn't contain literal numerical digits, for example, will likely contain some poorly-represented arithmetic i.e. "one and one is two" sort of stuff that has all the potential meanings of "and" and "is" mucking up the data. A dataset might have to be orders of magnitude larger if this is the sort of arithmetic data it contains.
In each of these cases, there are certainly some answers we (you and I) don't have because we don't have the training data or the computing power to ask. For example, if we wanted to know what kind of data teaches the LLM arithmetic most effectively, we'd have to acquire a bunch of data and train a bunch of models and then test their performance. But that's a far cry from "We have no idea". Given what we know about how the programs work and what the input data is, we can reason very effectively about how the program will behave even without access to the data. And given that some people do have access to the training data, the idea that we (all humans) can't understand this, is very much not in evidence.
> I suggest you look at the paper I linked about emergent capabilities without trying to nitpick aspects that can be used to argue against my point.
I had read the paper before you linked it, and did not think it was a particularly well-written paper, because of the criticism I posted earlier.
I think using the phrase "emergent capabilities" to describe "An ability is emergent if it is not present in smaller models but is present in larger models." is a poor way to communicate that idea, which has been seized upon by media and misunderstood to mean that something unexpected has occurred. If you understand how LLMs work, then you know that larger datasets produce more capabilities. That's not unexpected at all: it is blindingly obvious that more training data results in a better-trained model. They spent a lot of time justifying the phrase "emergent capabilities" in the article, likely because they knew that the public would seize upon the phrase and misinterpret it.
If you don't believe I read the paper, you'll note that my doubt that GPT's ability to do arithmetic actually is a function of size originally actually came from the paper, which notes "We made the point [...] that scale is not the only factor in emergence[.]"
There's a separate issue with what you've said in this conversation. It seems that you believe we don't understand the models, because we didn't produce the weights. Is that an accurate representation of your belief? Note that I'm asking, not telling you what you believe: please don't tell me what my central point is again.
> As my other example shows, it can’t handle simple languages like “English backwards.”
But that’s not an LLM issue, its a representation model issue. That would be a simple language variant for a character-based LLM, but its a particularly difficult one for a token-based one, in the same way that certain tasks are more difficult for a human with, say, severe dyslexia.
Yes, that's my point: we understand how LLMs work (for example, we know ChatGPT is token-based, not character based), which is what allows us to predict where they'll fail.
Describing things that LLMs can't do isn't proof we understand them. I can describe things that human brains cannot do without understanding how human brains work.
> The article is, frankly, not very interesting. You change the inputs and get different outputs? What a surprise!
Here is what this argument amounts to:
"Humans are, frankly, not very interseting. You change the things you tell them and you get different responses? What a surprise!"
I'm not sure what your point is. I wish you'd just read the paper sigh.
Quote from the paper:
> Emergent few-shot prompted tasks are also unpredictable in the sense that these tasks are not explicitly included in pre-training, and we likely do not know the full scope of few-shot prompted tasks that language models can perform.
> "Humans are, frankly, not very interseting. You change the things you tell them and you get different responses? What a surprise!"
Maybe try making an argument that doesn't involve twisting what I say. I notice you haven't responded to my other post where I clarified what my core point is.
If you wrote a paper saying that people respond to different stimuli with different responses, I would in fact say that paper is not interesting. Just as the paper about AIs creating different outputs from different inputs is not interesting.
That isn't a statement about humans, just as my other statement wasn't a statement about AI. It's a statement about the paper.
In fact, I do think ChatGPT and other LLMs are very interesting. They're just not beyond human understanding.
Again: LLMs are very interesting. The paper you posted, isn't interesting.
> I'm not sure what your point is. I wish you'd just read the paper sigh.
I did read the paper, and as long as you accuse me of not reading the paper, my only response is going to be that I did read the paper. It would be a better use of everyone's time if you refrained from ad hominem attacks.
> Quote from the paper:
> > Emergent few-shot prompted tasks are also unpredictable in the sense that these tasks are not explicitly included in pre-training, and we likely do not know the full scope of few-shot prompted tasks that language models can perform.
1. As mentioned before, they're using a pretty specific meaning of "emergent".
2. The word "explicitly" is doing a lot of work there. If you've got a massive training dataset and you don't explicitly include arithmetic, but you include, for example, Wikipedia, which contains a metric fuckton of arithmetic, then maybe it's not so surprising that your model learns something about arithmetic.
As with "emergent abilities", this is poor communication--and as with "emergent abilities", I think it's intentional. They're writing a paper that's likely to attract attention, because that's how you get funding for future research. But since they know that it wouldn't the paper also has to pass academic scrutiny, they include weasel-words like "explicitly" so that what they're saying is technically true. To the non-academic eye, this looks like they're claiming the abilities are surprising, but to one practiced in reading this sort of fluff, it's clear that the abilities aren't surprising. They're only "surprising in the sense that..." which is a much narrower claim. In fact, the claim they're making amounts to, "Emergent few-shot prompted tasks are also unpredictable in the sense that you can't predict them if you don't bother to look at what's in your training data".
Likewise with the second half of the sentence. Obviously we don't know all the few-shot prompted tasks the models can perform: it's trivial to prove that's an infinite set. But that doesn't mean that, for given few-shot prompted tasks and a look through the training data, you can't predict whether the model can perform it.
The paper is full of these not-quite interesting claims. Perhaps more directly in line with your point, the paper says: "We have seen that a range of abilities—in the few-shot prompting setup or otherwise—have thus far only
been observed when evaluated on a sufficiently large language model. Hence, their emergence cannot be
predicted by simply extrapolating performance on smaller-scale models." That's again obviously true: obviously, if you look only at the size of the models and don't look at the training data, you won't be able to predict what's cached in the model. If you don't look at the inputs, and you'll be surprised by the outputs!
I don't blame the authors for doing this by the way--it's just part of the game of getting your research published. If they didn't do this, we might not have read the paper because it might never have gotten published. Don't hate the player, hate the game.
https://openreview.net/pdf?id=yzkSU5zdwD
FWIW, none of the capabilities of LLMs are expected, but the above are not expected in the sense that they suddenly emerge beyond a certain scale in a way that couldn't be extrapolated by looking at the smaller models.
> For example, it cannot produce something from nothing: any output the program produces must strictly be a combination of its inputs. From this we can tell that no program produced is capable of originality.
Can you explain what you mean by "strictly a combination of its inputs"?
I've personally implemented a few (non-language) NN prediction models and they definitely extrapolate, someitmes in quite funny (but still essentially accurate) ways when given previously unseen (and ridiculous) inputs.