Yeah, understood. I'm excited for the reduction in parameter count that will come when this is taken up in major models.
I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.
You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.
I meant it rhetorically in reference to interpretability. I don't see a real difference between training a model that is 100b parameters vs a (fixed) 4x recurrent 25b parameter model as far as understanding what the model is `thinking` for the next token prediction task.
You should be able to use the same interpretability tooling for either. It can only `scheme` so much before it outputs the next token no matter if the model is just a fixed size and quite deep, or recurrent.