Hey, this is really interesting. What are the features you used to measure the reasoning complexity? In other words, how does one evaluate a query during classification?
We use an adaptive classifier to learn how many tokens the model takes to respond correctly on a known dataset. I used the https://huggingface.co/adaptive-classifier/llm-router for experiments it is based on distilbert.