Fair point. More benchmarks are definitely good but I’m optimistic that they will show similar results.
Anecdotally, I can say that my personal experience with the model is in line with what the benchmarks claim: It’s a bit smarter than R1, a bit faster than R1, much faster than R1-0528, but not quite as smart. (Faster meaning less output tokens). For me, it’s at a sweet spot and I use it as daily driver.
We need about three orders of magnitude more tests to make these numbers meaningful.