Not really. o3-low compute still stomps the benchmarks and isn't anywhere that expensive and o3-mini seems better than o1 while being cheaper.
Combine that with the fact that LLM inference has reduced orders of magnitudes in cost the last few years and hampering over the inference costs of a new release seems a bit silly.
If you are talking about ARC benchmark, then o3-low doesn't look that special if you take into account there are plenty of finetuned models with much smaller resources achieved 40-50% results on private set (not semi-private like o3-low).
- I'm not just talking about ARC. On frontier Math, we have 2 scores, one with pass@1 and another with consensus vote with 64 samples. Both scores are much better than previous Sota.
- Also apparently, ARC wasn't a special fine-tune but rather some of the training set in the corpus for pre-training.
>that result is not verifiable, not reproducable, unknown if it was leaked and how it was measured. Its kinda hype science.
It will be verifiable when the model is released. Open ai haven't released any benchmark scores that were shown falsified later so unless you have an actual reason to believe they're outright lying then it's not something to take seriously.
Frontier Math is a private benchmark with its highest tier of difficulty Terrence Tao says:
“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
Unless you have a reason to believe answers were leaked then again, not interested in baseless speculation.
>its private for outsiders, but it was developed in "collaboration" with OAI, and GPT was tested in the past on it, so they have it in logs somewhere.
They have logs of the questions probably but that's not enough. Frontier Math isn't something that can be fully solved without gathering top experts at multiple disciplines. Even Tao says he only knows who to ask for the most difficult set.
Basically, what you're suggesting at least with this benchmark in particular is far more difficult than you're implying.
>If you think this entire conversation is pointless, then why do you continue?
There's no point arguing about how efficient the models are being (the original point) if you won't even accept the results of the benchmarks. Why i'm continuing ? For now, it's only polite to clarify.
> Frontier Math isn't something that can be fully solved without gathering top experts
Tao's quote above referred on hardest 20% problems, they have 3 levels of difficulty, presumably first level is much easier. Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.
> There's no point arguing
Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.
The lowest set is easier but still incredibly difficult. Top experts are no longer required sure but that's it. You'll still need the best of the best undergrads at the very least to solve it.
>Also, as I mentioned OAI collaborated on creating benchmark, so they could have access to all solutions too.
Open AI didn't have any hand in providing problems, why you assume they have the solutions I have no idea.
>Lol, let me ask again, why you are arguing then? Yes, I have strong reasonable(imo) doubt that those results are valid.
Are you just bring obtuse or what ? I stopped arguing with you a couple responses ago. You have doubts? good for you. They don't make much sense but hey, good for you.
Not necessarily. And this is the problem with ARC that people seem to forget.
- It's just a suite of visual puzzles. It's not like say GSM8K where proficiency in it gives some indication on Math proficiency in general.
- It's specifically a suite of puzzles that LLMs have shown particular difficulty in.
Basically how much compute it takes to handle a task in this benchmark does not correlate with how much it will take LLMs to compute tasks that people actually want to use LLMs for.
If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading. And also the "representative" point is really moot from the discussion perspective (i.e. saying o3 stomps benchmarks, but the benchmarks aren't representative).
*I don't think that is the case as you can at least make relative conclusions (i.e. o3 vs o1 series, o3-low is 4x to 20x the cost for ~3x the perf). Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.
PS: I know there are more benchmarks like SWE-Bench and Frontier Math, but this is the only one showing data about o3-low/high costs without considering the CodeForces plot that includes o3-mini (that one does look interesting, though right now is vaporware) but does not separate between compute scale modes.
>If the benchmark is not representative of normal usage* then the benchmark and the plot being shown are not useful at all from a user/business perspective and the focus on the breakthrough scores of o3-low and o3-high in ARC-AGI would be highly misleading.
ARC is a very hyped benchmark in the industry so letting us know the results is something any company would do whether it had a direct representation on normal usage or not.
>Even if it is pure marketing they expect people to draw conclusions using the perf/cost plot from Arc.
Again, people care about ARC, they don't care doing the things ARC questions ask. That it is un-economical to pay the price to use o3 for ARC does not mean it would be un-economical to do so for the tasks people actually want to use LLMs for. What does 3x the performance in say coding mean? You really think companies/users wouldn't put up with the increased price for that? You think they have Mturkers to turn to like they do with ARC?
ARC is literally the quintessential 'easy for humans, hard for ai' benchmark. Even if you discard the 'difficulty to price won't scale the same' argument, it makes no sense to use it for an economics comparison.
In summary: so the "stomps benchmarks" means nothing for anyone trying to make decisions on that announcement (yet they show cost/perf info). It seems, hipey.
Combine that with the fact that LLM inference has reduced orders of magnitudes in cost the last few years and hampering over the inference costs of a new release seems a bit silly.