And while this is true that this experiment shows that you can reproduce the concept of direct reinforcement learning of an existing LLM, in a way that makes it develop reasoning in the same fashion Deepseek-R1 did, this is very far from a re-creation of R1!
R1 or the R1 finetunes? Not the same thing...
HF is busy recreating R1 itself but that seems to be a pretty big endevour not a $30 thing