Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Will a prompt that enables GPT-4 to solve easy Sudoku puzzles be found? (manifold.markets)
37 points by kilobit on July 21, 2023 | hide | past | favorite | 66 comments


The title is funny to me. We should consider a new computation complexity class for LLMs. Let's call the ones that can be solved with a prompt, Promptable. For the problems that we cannot reliably solve with a single prompt yet, let's call them non-deterministic promptable, or NP.

Question is, for most of these hard problems, is there a prompt that can solve them? Better yet, is there a prompt good enough that we collapse all of the hardest problems in NP with a single prompt?

Will we ever know if NP can be reduced to P???


Or prompt(n) time, where n = the known minimum # of prompts required to solve a given class of problems.

From there we can define various classes of problems:

1) those with an absolute floor minimum # of required prompts

2) those with a known ceiling

Etc.

This should be combined with traditional Big-O notation to provide a more specific classification, e.g. a constant-time complexity task with a known ceiling of two prompts would be prompt(2)-O(1)

A problem known to, in some specific cases but not all, be solvable with some minimum number of prompts with no ceiling known might be Prompt(n(np)) where n = minimum prompts known to solve at least some problems in that class.

Classifications would be applied to specific systems but the best performing system would set the general classification for a problem. So the general classification for a problem might be prompt(1(5)) to denote a min 1 max 5 prompts required based on the best performance seen to date, a specific system might only rate a prompt(3(np)) classification.

I’m overthink this but I think I like it.


From what I can tell, experts currently project the problem "TURN HUMANS TO PAPERCLIPS" is in complexity class "Promptable," but that the Boolean Satisfiability problem is not in "Promptable."


Are large language models even Turing complete? Or more specifically, is there something we can say about LLMs as a class with respect to this question?

For any architecture like Vaswani’s GPT or a bigger iteration of it, eventually you run out of attention heads and layers.

If the answer is categorically no, then any sufficiently sophisticated code is not “promotable”. However, I don’t think there’s anything in principle which prevents LLMs from being Turing complete.


> Are large language models even Turing complete?

Idealized deterministic computing systems are the only thing that can be Turing complete, actual systems cannot be (because Turing completeness requires infinite space), LLMs are actual systems, and also are not limited-space approximation of idealized deterministic systems (they are, I suppose, deterministic if you know all the relevant parameters, including potentially some that are hardware-dependent, but they generally are a deterministic approximation of a nondeterministic system.) You can, of course, prompt an LLM to predict the output of a deterministic system and to do direct computation, but, absent an interface to external tools that actually do the computation, the results for that are notoriously unreliable.


> Idealized deterministic computing systems are the only thing that can be Turing complete

That’s not true. My computer is for all practical purposes Turing complete - it’s tape is not the RAM, but due to side effecting, being connected to the internet, the whole universe. So while the universe itself is finite, nothing material can be mathematically infinite, Turing completeness fails “lazily”. Unless you hit the limits, it is as good as infinite.

As for the current topic, prompting problems are after a while just memoization to some limit with some strange encoding.


> My computer is for all practical purposes Turing complete

“For all practical purposes” is a long way of saying “not”; a large-but-finite tape is not infinite, and the key properties of Turing completeness (both universal computation and the consequent equivalence with all other Turing complete systems) do not hold with “finite but large tapes”, no matter if large is 640 kilobytes or 640 quettabytes. Particularly, differently structured “Turing complete but for finite size” systems of similar actual capacity in bytes are not guaranteed to be able to compute the same subset of all computable results. (Actually Turing machines with the same size tape would be, but “Turing complete but for size” does not imply a consistent ratio between problems of material storage space to equivalent Turing machine tape size.)


Can you give me any way that can differentiate between my practical machine with memory sized n, and a real infinite Turing machine in a finite amount of time t?

If not, than for all purposes the two are the exact same, which is my point. This is not the case with LLMs.


Sure, then a trivial example of a non-“promotable” input would be an input containing a problem whose solution requires more memory than any computer currently has. But I don’t think that’s what they were looking for.


With memory they are. https://arxiv.org/abs/2301.04589


You are mixing up LLMs with Transformers. Transformers with memory are Turing complete, but AFAIK, current state of the art LLMs aren't trained with any kind of memory.


I'm not. The paper specifically deals with language models in particular. Memory is just dynamically inserted context


So as I see it, it's possible to teach SOTA LLM with prompts to emulate a state machine part of the Turing machine. Am I right?


Ok. I see. Sorry.


> Are large language models even Turing complete?

If you want to see where they have problem ask them to do something about deep hierarchical objects. For example, consider this prompt: "Draw me a complete binary tree with numbers from 1 to 128 using pseudographics"

In my experience, the deeper the structure, the more problematic it is for the current generation of LLMs.


attention is turing complete https://news.ycombinator.com/item?id=36332033

I'd guess it's not as efficient as a native algorithm i many cases though


It's not realistically Turing complete. It assumes infinite precision.


Right but a Turing computer assumes infinite storage space which is itself impossible. You cannot have infinite precision without infinite storage, and all real computers that we colloquially say are Turing complete have finite everything.


It's possible to argue that there's no such a thing as infinite precision floats. I.e. all objects have limited "size", and real numbers is just a nice abstraction to simplify discrete world. (https://en.wikipedia.org/wiki/Finitism)

Turing machine is not such thing. At each moment in time, only finitely many cells of the tape are used. (The same applies to natural numbers, there are infinitely many of them, but each one of them has a finite description).


It lazily uses infinite tape though, so it is Turing complete for an actual, finite number of steps. You can’t have infinite precision even for a single step.


> Will we ever know if NP can be reduced to P???

My opinion is that it cannot, due to the unbounded nature of NP problems[0]. Regarding sudoku specifically, the question is a bit more nuanced (as described here[1]).

As for the NP nature of sudoko in its general form, a short but very informative description can be found here[2].

HTH

0 - https://en.wikipedia.org/wiki/NP_(complexity)

1 - https://stackoverflow.com/questions/50703174/is-sudoku-np-co...

2 - http://www.cs.ox.ac.uk/people/paul.goldberg/FCS/sudoku.html


> The title is funny to me. We should consider a new computation complexity class for LLMs. Let's call the ones that can be solved with a prompt, Promptable. For the problems that we cannot reliably solve with a single prompt yet, let's call them non-deterministic promptable, or NP.

Promptable is old hat now. I propose an entirely new class of problems which is whether you can prompt GPT to construct a prompt for GPT that can solve a problem. I call it Deep-Promptability™ (patent pending). The "order" is defined by how many levels of prompting can you solve the problem in, so if a problem is order-3 deep-promptable then you can prompt GPT to construct a prompt for GPT that will construct a prompt that allows GPT to solve it.


I honestly can't tell if any of the other replies to this comment got the joke.


An LLM is a tool. It is a very versatile tool. It can be used in many situations. It does not therefore follow that it should be used in all situations. Even if you wanted to use an AI to solve sudoku, there is no particular reason to begin with a model trained for language modeling instead of a model better suited to the task.


I would, uh, bet that you're right.

But given that there has been a lot of discussion of the possibility that an LLM has "general intelligence", it seems worthwhile to figure out whether the solving of a random problem is possible.


They don't possess general intelligence, end of discussion. Thanks for attending my TED talk.


Seriously. Will someone make general intelligence by gluing together an LLM and some other AI stuff? I dunno, maybe. But currently existing LLMs don’t have GI and it’s really easy to show this by chatting with them and asking them GI questions not in the training data.


I don't get it there are so many ways to solve sudokus why does anyone care about this anyways?


Well, it's not really about finding a way to solve sudokus.

Nobody involved in this cares for that as a goal in itself.

It's about the mystery of why an LLM can't do it well.

It's about the challenge of finding a way (prompt) to get it to.

It's about what this reveals about the inner workings and limitations of an LLM.


So maybe I think about things a little differently, but is there a theoretical reason why we should expect a large language model to be good at sudokus? I remember not long ago they often struggled with adding two numbers


>is there a theoretical reason why we should expect a large language model to be good at sudokus

Because LLMs have shown the ability to be good at many tasks not directly related to language, and even exhibited some crude "general intelligence" traits.

So, some people would like to find how far this can be pushed, and why it works for e.g. a lot of tasks involving abstract manipulation of symbols and logical analysis, but not for a basic enough and clear goal like solving a simple sudoku.


What tasks would you say LLMs are good at that are not related to language?


It's very hard to define what is and is not "related to language" and this is kind of a fundamental question that seemed to get a lot of attention in the 20th century. Maybe these language models can help shine some light on that.

According to OpenAI, GPT-4 scores 4 on AP Calculus BC, 5 on AP Statistics, 4 on AP Chemistry, 4 on AP Physics 2. But is mathematical/logical reasoning largely a language task? I don't really know. I feel pretty confident saying that riding a bike is not a language task, but logical reasoning, I'm not so sure.


You also have to recall that these models were trained on the study materials of all of those tasks. That doesn't cheapen the achievement except to say, it's not "emergent behavior". Probably has half a billion weights dedicated to each of those exams.


Exactly what I was trying to imply. Very difficult to classify what is not relevant to language.


LLMs are good at a lot of things we don't have a good reason to expect them to be good at. It's very hard to come up with "theoretical reasons" it should be good at things, in "theory" they should not be nearly as capable as they are. Even NLP researchers have been shocked at how well this has worked.


If there is no theory, or expected result why should anyone care what it's good at or not? You kinda get what you get and if you don't get what you want you do what?


It’s just a well known problem case that has a straightforward answer that is easily verifiable.

Eg. Can a model play tic-tac-toe or solve chess puzzles


I feel like it's kind of a weird question because if you change the random seed enough times maybe one of them could be good at chess puzzles but suck at being a chat bot, or be good at sudokus but be a horrible pair programmer. I don't know what value a lot of these questions bring once a model hits a trillion parameters of which none or very very few are understood.


Austin from Manifold here - cool to see this trending! I thought the structure of this prediction market was especially cool, as it forms a collaborative, crowdsourced puzzle challenge to generate the perfect prompt.

(I've personally bet yes, but not sure if that prediction is holding up...)


> Easy-rated Sudoku puzzle means a puzzle classified as easy by any reputable Sudoku site or puzzle generator. This market plans to use the LA Times(Sudoku - Free daily Sudoku games from the Los Angeles Times (latimes.com)) for judging, but I maintain the option to use a different Sudoku generator.

Is there any theoretical reason why an attention based llm could or couldn't generate an answer to an NP hard problem? As I understand, attention is N^2, but it's not obvious if that's relevant to the complexity of problems that can be solved. It's obviously not relevant to answers that are regurgitated, which may be all answers?

It would be better if "easy" had a mathematical definition.


A single forward pass shouldn't be able to, but remember the format allows it to be iterated. So it should be Turing Complete, if the error rate is low enough and enough iterations are allowed.


I recommend reading the theoretical work on the computational capabilities of Transformers: https://twitter.com/lambdaviking/status/1630581475425828864 References to other work can probably be found in that article.

Shameless plug to my own blogpost about this: https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/

TL;DR: The theoretical class of problems that Transformers can solve (without Chain-of-Thought style responses) is fairly limited. Generally, universal approximation proofs rely on infinite precision assumptions, which are not practical in reality. Empirical results also show very limited capabilities when tested on certain formal languages.

In the Sudoku case, the problem-length is limited, so one could conceptually make a large enough model that could memorise all solutions to all possible combinations of permissible sudoku boards, which could then just access and read out the solutions.


> The theoretical class of problems that Transformers can solve (without Chain-of-Thought style responses) is fairly limited.

Which is irrelevant because how would a Transformer emit a complete Sudoku solution in a single forward-pass/token in the first place?


I suppose you mean in order to give the answer to a Sudoku puzzle, you'd need a string of tokens anyway: [(x,y) grid coordinates], [digit].

I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do.

My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku, you'd need the intermediate CoT "thinking out loud"


> I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do.

Yes, and maybe the occasional generation of the complete boardstate to date, because you don't want to leave the boardstate implicit and require it to be reconstructed within each forward pass - that's 'using up serial computations' that a Transformer can't afford. But if you periodically serialize the best-answer-to-date, you are more likely to be able to bite off a chewable chunk.

> My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku

A Transformer is not much different from an unrolled RNN without weight-sharing, so for any specific sudoku size, there should be some depth which does allow the worst-case amount of backtracking or other solution to the problem. (One way to show this would be to use the RASP programming language to program such a solver.) It's just it'd probably be bigger/deeper than you have available now.


Right, I see your point. Since Sudoku is fixed-size, you can always construct a Transformer with the worse-case depth. That makes sense.

I was assuming given a trained Transformer, you wouldn't know how many effective "steps of computation" it contained, and so would probably have to resort to CoT.


> Is there any theoretical reason why an attention based llm could or couldn't generate an answer to an NP hard problem?

Putting aside for the moment that a Large Language Model (LLM) is a predictive statistical model based on and producing from what consisted its training set, answering whether or not any algorithm can solve an NP hard problem first requires a clarification; is a brute force exhaustive search allowed?

If it is, I am unsure if an arbitrary LLM could find a solution due to the dependence on training. If not, I am confident in saying an LLM could not solve arbitrary NP hard problems in P time as that has yet to be proven possible AFAIK.


LLM's aren't statistical in the sense of memorizing percentages of words that come after other words. They are modeling a very high dimensional function using a neural net. I suppose they're statistical in the sense of learning how to mimic what they've seen, but this includes some very surprising emergent abilities as well.


> LLM's aren't statistical in the sense of memorizing percentages of words that come after other words.

Agreed, in that LLM's are an improvement beyond Bayesian models[0].

> I suppose they're statistical in the sense of learning how to mimic what they've seen, but this includes some very surprising emergent abilities as well.

Your point of "mimic what they've seen" is what I mean by being predictive statistical models. And yes, there very well can be surprising, even emergent, output given depending on the training data set.

But to refocus back onto the original question the article presents, which is could an LLM somehow produce solutions to a problem category which has no solution with mathematical underpinning, is a bit fantastical IMHO.

0 - https://en.wikipedia.org/wiki/Bayesian_statistics



I haven't seen prediction contacts that can be resolved at 50% before. It seems like that throws off the interpretation of the probability.


For most sudoku puzzles solving each cell is a logic puzzle that can be expressed in words just as current solvers have it in code. It can try to solve each vacant cell in turn using each of its rules until it finds a solution for that cell. And then it can be told to keep trying to solve another cell until it finishes the puzzle. Brute force with a core of logic.


Unlikely, for reasons explained in this video: https://youtu.be/bEovhfxJsM4?t=2339

However, apparently it can write a program using the Z3 SAT solver to find a solution.


It seems like it might be possible to get around this by letting the model emit moves like ` discard last 5` which would also let it keep a history of it's previous branches.


Unlikely, for reasons explained in this video: https://youtu.be/bEovhfxJsM4?t=2339

However, it apparently can write a program using the Z3 SAT solver to find a solution.


> However, it apparently can write a program using the Z3 SAT solver to find a solution.

Not that it's unimpressive in general that a LLM can write a program from a prompt like that, but for this particular juxtaposition it doesn't seem like it's especially interesting or impressive that it can write such a program. A SAT program is basically just re-stating the the rules in a particular form. It doesn't even have to be able to apply those rules. The solver does the hard work.


Also, if it has enough training data to know how to write such a SAT solver problem, it has surely seen sudoku exercises as part of that data set..


Will a prompt that enables a prompt that enables GPT-4 to solve easy Sudoku puzzles be found ?

I don't know how much meta-prompting have been explored. Maybe it's where The Singularity is at ?


Can we use this technology to find a recognizable pattern in any complex blob of data? Is that how this works?

How about, "Given 100000 readings from a person's body/brain, determine whether they are lying". Can we do that?


I don’t understand how so many people on Hacker News engage in this line of questioning.

If by “this technology” you mean “large neural networks” the answer is yes, and we’ve been doing so for several decades now. That’s very specifically what they’re good at.

If you mean “LLMs like ChatGPT” specifically, then no, they’re extremely large neural networks trained on very specific data sets. To perform a different recognition task, you train with different data sets.

Where does this idea that ChatGPT and friends are general-purpose come from?


>Where does this idea that ChatGPT and friends are general-purpose come from?

Maybe reality?

https://general-pattern-machines.github.io/

Large Language Models are as general purpose as they come especially for Machine Learning. They generalize to any kind of pattern, linguistic or not.


Ya. They've used NN for recognizing faces, airplanes, trolls, songs... all kinds of stuff.

Gleaning useful order from vast dizzying complexity is the name of the game. Or so is my rudimentary understanding.


>Can we use this technology to find a recognizable pattern in any complex blob of data? Is that how this works?

Possibly in general. https://general-pattern-machines.github.io/

As for your example, I don't think so.


Seeing this thread makes me want to bust out some Prolog.


Only allowing 50 generations makes this very hard.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: