The catch is that performance is not actually comparable to o4-mini, never mind o3.
When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.
It's definitely interesting how the comments from right after the models were released were ecstatic about "SOTA performance" and how it is "equivalent to o3" and then comments like yours hours later after having actually tested it keep pointing out how it's garbage compared to even the current batch of open models let alone proprietary foundation models.
Yet another data point for benchmarks being utterly useless and completely gamed at this stage in the game by all the major AI developers.
These companies are clearly are all very aware that the initial wave of hype at release is "sticky" and drives buzz/tech news coverage while real world tests take much longer before that impression slowly starts to be undermined by practical usage and comparison to other models. Benchmarks with wildly over confident naming like "Humanity's Last Exam" aren't exactly helping with objectivity either.
When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.