Hacker Newsnew | past | comments | ask | show | jobs | submit | tosh's commentslogin

Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |

yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding

When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.

I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.

Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.

'feel' is no more accurate

not saying there's a better way but both suck


Speak for yourself. I've been insanely productive with Codex 5.2.

With the right scaffolding these models are able to perform serious work at high quality levels.


He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck

..huh?

The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.

Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.

Like can the model take your plan and ask the right questions where there appear to be holes.

How wide of architecture and system design around your language does it understand.

How does it choose to use algorithms available in the language or common libraries.

How often does it hallucinate features/libraries that aren't there.

How does it perform as context get larger.

And that's for one particular language.


The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.

At the end of the day “feel” is what people rely on to pick which tool they use.

I’d feel unscientific and broken? Sure maybe why not.

But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.


yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities

I don’t think this is even remotely true in practice.

I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.

The idea that all models have very close performance across all domains is a moderately insane take.

At any given moment the best model for my actual projects and my actual work varies.

Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.


Your feeling is not my feeling, codex is unambiguously smarter model for me

Benchmarks are useless compared to real world performance.

Real world performance for these models is a disappoint.


It feels like the gap between open weight and closed weight models is closing though.

Mode like open local models are becoming "good enough".

I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.

When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.


See, the market is investing like _that will never happen_.

I'm just riding the VC powered wave of way-too-cheap online AI services and building tools and scaffolding to prepare for the eventual switch to local models =)


follow up to the recent Z-Image Turbo release:

https://news.ycombinator.com/item?id=46095817


I think you can put a bunch of apple silicon macs with enough ram together

e.g. in an office or coworking space

800-1000 gb ram perhaps?


afaiu not all of their models are open weight releases, this one so far is not open weight (?)

What would a good coding model to run on an M3 Pro (18GB) to get Codex like workflow and quality? Essentially, I am running out quick when using Codex-High on VSCode on the $20 ChatGPT plan and looking for cheaper / free alternatives (even if a little slower, but same quality). Any pointers?

Nothing. This summer I set up a dual 16GB GPU / 64GB RAM system and nothing I could run was even remotely close. Big models that didn't fit on 32gb VRAM had marginally better results but were at least of magnitude slower than what you'd pay for and still much worse in quality.

I gave one of the GPUs to my kid to play games on.


Yup, even with 2x 24gb GPUs, it's impossible to get anywhere close to the big models in terms of quality and speed, for a fraction of the cost.

I'm running unsloth/GLM-4.7-Flash-GGUF:UD-Q8_K_XL via llama.cpp on 2x 24G 4090s which fits perfectly with 198k context at 120 tokens/s – the model itself is really good.

I can confirm, running glm-4.7-flash-7e-qx54g-hi-mlx here, a 22gb model @q5 on m4 max pro and 59 tokens/s.

Short answer: there is none. You can't get frontier-level performance from any open source model, much less one that would work on an M3 Pro.

If you had more like 200GB ram you might be able to run something like MiniMax M2.1 to get last-gen performance at something resembling usable speed - but it's still a far cry from codex on high.


at the moment, I think the best you can do is qwen3-coder:30b -- it works, and it's nice to get some fully-local llm coding up and running, but you'll quickly realize that you've long tasted the sweet forbidden nectar that is hosted llms. unfortunately.

They are spending hundreds of billions of dollars on data centers filled with GPUs that cost more than an average car and then months on training models to serve your current $20/mo plan. Do you legitimately think there's a cheaper or free alternative that is of the same quality?

I guess you could technically run the huge leading open weight models using large disks as RAM and have close to the "same quality" but with "heat death of the universe" speeds.


18gb RAM it is a bit tight

with 32gb RAM:

qwen3-coder and glm 4.7 flash are both impressive 30b parameter models

not on the level of gpt 5.2 codex but small enough to run locally (w/ 32gb RAM 4bit quantized) and quite capable

but it is just a matter of time I think until we get quite capable coding models that will be able to run with less RAM


ahem ... cortex.build

Current test version runs in 8GB @ 60tks. Lmk if you want to join our early tester group!


Z.ai has glm-4.7. Its almost as good for about $8/mo.

Not sure if it's me but at least for my use cases (software devl, small-medium projects) Claude Opus + Claude Code beats by quite a margin OpenCode + GLM 4.7. At least for me Claude "gets it" eventually while GLM will get stuck in a loop not understanding what the problem is or what I expect.

Right, GLM is close But not close enough. If I have to spend $200 for Opus fallback i may as well not use it always. Still an unbelievable option if $200 is a luxury, the price-per-quality is absurd.

A local model with 18GB of ram that has the same quality has codex high? Yeah, nah mate.

The best could be GLN 4.7 Flash, and I doubt it's close to what you want.


"run" as in run locally? There's not much you can do with that little RAM.

If remote models are ok you could have a look at MiniMax M2.1 (minimax.io) or GLM from z.ai or Qwen3 Coder. You should be able to use all of these with your local openai app.


antigravity is solid and has a generous free tier.

I'm just speculating (please chime in if you know more!).

Afaiu protocol buffers are very popular for over the wire communication and in that use case cap'n proto does not win you much @ performance (?).

And if you don't go over the wire the performance difference might not matter for many use cases (e.g there is plenty of json sent around as well) and when it does matter: cap'n proto or something custom w/ desired performance characteristics other than protocol buffers is chosen but here cap'n proto covers a slim middleground (?).

+ usual adoption dynamics: protocol buffers have more mindshare and if another solution is not clearly way better for the use case: switching/trialing is not happening as much. (?)


I like how it (afaiu) leaves my codex config alone (meaning when I start codex the usual way I have the usual settings but when I launch it via ollama I get the model from ollama).

That said: many open weight models, while quite capable are not a great 1:1 fit for each agent harness and it's not easy to figure out if it is an issue with the model, model size, quantization, inference parameters or something else or a combination.

So for me it is a bit frustrating to pinpoint if a 'bad' experience is just 1 simple change away from a great experience.

For ollama specifically: make sure in settings you have set the context window size to something around 64k or 128k.

The default is 4k which is not enough for agent workflows. I have set it to 256k initially but that then was too much apparently (maybe i need more RAM).

Also it is possible that you have to restart ollama once you have changed the context window size or once you have downloaded the model (?).

Also check if the model you want to use is a good fit for your RAM, if in doubt pick a smaller model first.

I had good results out of the box with qwen3 coder and gpt 20b.


generative ai increases ambition, lowers barriers

more open source, better open source

perhaps also more forking (not only absolute but also relative)

contribution dynamics are also changing

I'm fairly optimistic that generative ai is good for open source and the commons

what I'm also seeing is open source projects that had not so great ergonomics or user interfaces in general are now getting better thanks to generative ai

this might be the most directly noticeable change for users of niche open source


What do you think of the paper's research claims that the returns for maintainers are reduced and sharing is decreasing?

Without real finance model innovation, what returns?

The same kind of returns that power research academia, where the amount of money you make is determined by the number of citations on your papers.

Except it's on Github and it's forks and starts.


I upvoted your comment.

Also, it's a scarcity mindset.

I don't agree that sibling to my comment: "make money by getting papers cited". it is not a long-term solution, much as Ad revenue is broken model for free software, also.

I'm hopeful that we see some vibe-coders get some products out that make money, and then pay to support the system they rely on for creating/maintaining their code.

Not sure what else to hope for, in terms of maintaining the public goods.


Any "exposure" economy has real money somewhere else turning the wheel. If that money isn't sensitive to healthy signal, neither is the downstream. Well-wishers are just alternative perpetual motion machine enthusiasts.

coding agents are fantastic for learning more about computers

they can not only generate code but also explain code, concepts, architecture and show you stuff

great learning tool


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: