Right, I see your point. Since Sudoku is fixed-size, you can always construct a Transformer with the worse-case depth. That makes sense.
I was assuming given a trained Transformer, you wouldn't know how many effective "steps of computation" it contained, and so would probably have to resort to CoT.
I was assuming given a trained Transformer, you wouldn't know how many effective "steps of computation" it contained, and so would probably have to resort to CoT.