I suppose you mean in order to give the answer to a Sudoku puzzle, you'd need a string of tokens anyway: [(x,y) grid coordinates], [digit].
I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do.
My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku, you'd need the intermediate CoT "thinking out loud"
> I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do.
Yes, and maybe the occasional generation of the complete boardstate to date, because you don't want to leave the boardstate implicit and require it to be reconstructed within each forward pass - that's 'using up serial computations' that a Transformer can't afford. But if you periodically serialize the best-answer-to-date, you are more likely to be able to bite off a chewable chunk.
> My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku
A Transformer is not much different from an unrolled RNN without weight-sharing, so for any specific sudoku size, there should be some depth which does allow the worst-case amount of backtracking or other solution to the problem. (One way to show this would be to use the RASP programming language to program such a solver.) It's just it'd probably be bigger/deeper than you have available now.
Right, I see your point. Since Sudoku is fixed-size, you can always construct a Transformer with the worse-case depth. That makes sense.
I was assuming given a trained Transformer, you wouldn't know how many effective "steps of computation" it contained, and so would probably have to resort to CoT.
Which is irrelevant because how would a Transformer emit a complete Sudoku solution in a single forward-pass/token in the first place?