> GPT-3 struggles with large numbers, decimal numbers, and negative numbers. When used it returns answers that are close but often incorrect.
Regarding GPT-3's "guesstimates," intuitively it feels like the network has to guess because it hasn't been given a way to do exact computation--a neural network is built out of nonlinear functions--even if it "understands" the prompt (for whatever value you want to give to "understand").
Are there any techniques that involve giving the model access to an oracle and allowing it to control it? To continue the analogy, this would be the equivalent of giving GPT-3 a desk calculator.
If this is a thing, I have other questions. How do you train against it? Would the oracle have to be differentiable? (There are multiple ways to operate a desk calculator to evaluate the same expression.) Also, what control interface would the model need so that it can learn to use the oracle? (Would GPT-3 emit a sequence of 1-hot vectors that represent functions to do, and would the calculator have "registers" that can be fed directly from the input text? Some way of indirectly referring to operands so the model doesn't have to lossily handle them.)
There are many papers trying to couple language models with external modules.
In the Retrieval-Enhanced Transformer (RETRO) paper a large language model was coupled with a similarity based text index. It can populate the prompt with relevant information from the index thus being more grounded and update-able.
In another paper (AlphaCode) the language model was coupled with a compiler and could run programs and check if they match the expected outputs for a few test cases. The model was able to solve competition style coding problems above average human score.
In another paper (Language Models as Zero Shot Planners) a language model generates commands to navigate a virtual home environment and performs tasks. The knowledge in the LM helps in quickly learning tasks.
A recent one can learn new concepts by simple conversation, then apply them where necessary. You can talk-train your model. (Memory assisted prompt editing to improve GPT 3 after deployment)
So the trend is to add "toys" on language models - a simulator, a compiler, a search engine, a long term memory module.
I'd like to see a recursive language model, that can sub-call itself to decompose problems.
Yeah, but I didn't bring it up because I wasn't sure how much is really the model choosing and how much is the human workflow: they emphasize the interactive part heavily.
Anyway, today another great paper dropped on self-distillation: "STaR: Bootstrapping Reasoning With Reasoning" https://arxiv.org/abs/2203.14465 , Zelikman et al 2022.
> I'd like to see a recursive language model, that can sub-call itself to decompose problems.
I tried a very simple and specific version of this a few years ago (Recursive Application of Recurrent Neural Networks) and it worked great for intent parsing: https://github.com/spro/RARNN
Would like to see what "real" researchers with more modern models could do with the concept.
> The model was able to solve competition style coding problems above average human score.
I am not sure if I am thinking of the right study, but as far as I remember the model included a human wading through and filtering solutions and while there may have been a compiler attached they also scored themselves. The marketing blurb of course tried to make it sound as if they had competed.
The model generates a large number of solutions, then they filter those that actually compile and generate the right output when executed, then they cluster to select a few (<10 solutions) and submit them. They are not allowed to present too many attempts.
Ah, the paper describes a fixed method for the last selection step and also AI generated tests to reduce the results even more before that. Quite a bit better, even if the participation is still only simulated.
I believe the dominant thinking is that GPT-3 has trouble with math because it doesn't see individual digits. It obviously has no trouble working on words, which are much more discreet than numbers. I wouldn't be surprised if it had trouble carrying a long equation though. When writing it can reconsider the whole context with each new word, externalizing that memory, but with most computations it would have to carry out the whole thing in one go. That's a lot of dedicated parameters for a single subtask.
Even the tokenization is wonky. Imagine if you had no concept of math characters and instead has a lookup table of common-ngrams (BPE encoding). For example, the binary addition function “a+b” may be tokenized as a unary “3+b” because “3+b” occurs commonly. That tokenization is vastly different from “3.00000001+b”. GPT has to invert this tokenization artifact with finite training data.
Yeah, I think that's the most accepted explanation. Everything after my first sentence was total speculation, the tokenization is usually cited as the issue.
> with most computations it would have to carry out the whole thing in one go
Is there a way to allow models to say "let me think about this some more"? With language models like GPT-3 you emit one token per inference iteration, with its previous output fed back in as input/state. Can models opt out of providing a token, but still update state? That would allow it to break up the computation into discrete steps.
RNN outputs "confidence" bit which can guide computation to perform more steps to obtain more confidence in the result. Essentially, RNN asks "let me think about that some more".
But, separate ablation study found that if you just drop confidence bit altogether and allow RNN to compute some more every time (e.g., always perform 4 computations on single input for 1 output), you get same or better results without extra complexity of training.
There is also Microsoft Research's paper I can't find right now about the variable computation for image classification where there is a "confidence" bit at some of the final layers - if lower layer is cinfident enough, it's output will be used for classification, otherwise the output of that layer will be passed into more transformation of upper layers.
> But, separate ablation study found that if you just drop confidence bit altogether and allow RNN to compute some more every time (e.g., always perform 4 computations on single input for 1 output), you get same or better results without extra complexity of training.
Do they saw what happens if you do both? Perhaps the “benefit from a higher computation/per cycle” phenomena and the “benefit from signalling relative computation resource allocation” one are different.
I guess I’ll have to try and read the paper, but I’m new to the literature and am clueless about the current state of research.
I believe GPT-3 has a transformer-based architecture. So it doesn't recursively ingest it's own output in each iteration. I believe attention-based transformer models have enough complexity to be able to learn what you are talking about on their own.
GPT-3's transformers only recur some finite amount. Attention does a lot compared to a bog standard RNN, and probably if the numbers were tokenized it would be enough for most reasonable computations, but eventually you definitely would hit a cap. That's probably a good thing, of course. The network and training are Turing complete together, but it would suck if the network itself could fail to terminate.
Thank you for pointing out the difference. I went and reread about transformers; previously I thought they were a kind of RNN. (I am not an ML engineer.)
That would be neat. You could give it backspace and "let me think more" tokens that would signal the inference program to run it again on the prompt plus its own output. That way it could generate "thoughts thoughts thoughts [THINKMORE] thoughts thoughts thoughts [THINKMORE] [BACKSPACE]X 8 (The real output would go here""
It would of course have to be penalized in some way for [THINKMORE]ing to avoid infinite processing time. It would have to learn to reason at what point diminishing returns would kick in from continuing to [THINKMORE] VS recording its best answer. The penilization function would have to take into account remaining tokens that would fit in the transformer prompt.
I think it would work, but backprop would be computed in a different way every time. I'm not an expert, so there may be sneaky ways around it, but I'm pretty sure you'd lose out on a long history of little efficiency improvements when you could just make it more recurrent instead.
Hardcoding a tokenization tweak that keeps individual digits separate would be a trivial change to the preprocessing that would not affect the rest of the model training process.
Not super well in the GPT-2 based models I have access to. It falls into different error modes though, diving into prose rather than even making an attempt. Makes sense in retrospect!
Yeah for sure. With energy prices soaring, Moore's law being morally over for since 2010, wages being so completely destroyed by the hatred Democrats have for them, and the sneaky little misconceptions and errors the golem's makers did not fight hard enough to let in, AI will be supplanted by plain I.
Check out my project https://github.com/Thopliterce/transformer-arithmetic. This is a concrete implementation based on GPT-2 model that does multiplication accurately, digit by digit. It does so by generating a dataset that tells the model how to do multiplication step by step. Doing arithmetic actually works with just GPT-2, without an oracle.
That's actually pretty straightforward: (Tested with EleutherAI GPT-J-6B because why use a closed model when an open one exists?)
Prompt:
"Question: Solve three plus six.
Answer:
a=3
b=6
a+b
Question: Solve twelve times fifteen.
Answer:
a="
And the model dutifully answered:
"a=12
b=15
a*b"
Which you could feed directly to a python console.
This kind of approach, where you make a long prompt to make the model understand the kind of result you want is named "prompt engineering" and I find it crazy how close we get to robopsychology.
Well, the theory around neural nets strongly suggests that enough nonlinear activation functions combined in the right way should be able to learn any function, including basic arithmetic. Now, whether or not you have the right approach to training the network to get the right set of weights is a different story...
Regarding GPT-3's "guesstimates," intuitively it feels like the network has to guess because it hasn't been given a way to do exact computation--a neural network is built out of nonlinear functions--even if it "understands" the prompt (for whatever value you want to give to "understand").
Are there any techniques that involve giving the model access to an oracle and allowing it to control it? To continue the analogy, this would be the equivalent of giving GPT-3 a desk calculator.
If this is a thing, I have other questions. How do you train against it? Would the oracle have to be differentiable? (There are multiple ways to operate a desk calculator to evaluate the same expression.) Also, what control interface would the model need so that it can learn to use the oracle? (Would GPT-3 emit a sequence of 1-hot vectors that represent functions to do, and would the calculator have "registers" that can be fed directly from the input text? Some way of indirectly referring to operands so the model doesn't have to lossily handle them.)