Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The only one it doesn't win is SWE bench which it is significantly behind Claude Sonnet. You just can't take down Sonnet.


One percentage point is not significant, neither in the colloquial nor the scientific sense[1].

[1] Binomial formula gives a confidence interval of 3.7%, using p=0.77, N=500, confidence=95%


Codex has been much better than Sonnet for me.


On what types of tasks?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: