yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding
I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.
Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.
The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.
The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.
I don’t think this is even remotely true in practice.
I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.
The idea that all models have very close performance across all domains is a moderately insane take.
At any given moment the best model for my actual projects and my actual work varies.
Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.
Mode like open local models are becoming "good enough".
I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.
When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.
I'm just riding the VC powered wave of way-too-cheap online AI services and building tools and scaffolding to prepare for the eventual switch to local models =)
What would a good coding model to run on an M3 Pro (18GB) to get Codex like workflow and quality? Essentially, I am running out quick when using Codex-High on VSCode on the $20 ChatGPT plan and looking for cheaper / free alternatives (even if a little slower, but same quality). Any pointers?
Nothing. This summer I set up a dual 16GB GPU / 64GB RAM system and nothing I could run was even remotely close. Big models that didn't fit on 32gb VRAM had marginally better results but were at least of magnitude slower than what you'd pay for and still much worse in quality.
I gave one of the GPUs to my kid to play games on.
I'm running unsloth/GLM-4.7-Flash-GGUF:UD-Q8_K_XL via llama.cpp on 2x 24G 4090s which fits perfectly with 198k context at 120 tokens/s – the model itself is really good.
Short answer: there is none. You can't get frontier-level performance from any open source model, much less one that would work on an M3 Pro.
If you had more like 200GB ram you might be able to run something like MiniMax M2.1 to get last-gen performance at something resembling usable speed - but it's still a far cry from codex on high.
at the moment, I think the best you can do is qwen3-coder:30b -- it works, and it's nice to get some fully-local llm coding up and running, but you'll quickly realize that you've long tasted the sweet forbidden nectar that is hosted llms. unfortunately.
They are spending hundreds of billions of dollars on data centers filled with GPUs that cost more than an average car and then months on training models to serve your current $20/mo plan. Do you legitimately think there's a cheaper or free alternative that is of the same quality?
I guess you could technically run the huge leading open weight models using large disks as RAM and have close to the "same quality" but with "heat death of the universe" speeds.
Not sure if it's me but at least for my use cases (software devl, small-medium projects) Claude Opus + Claude Code beats by quite a margin OpenCode + GLM 4.7. At least for me Claude "gets it" eventually while GLM will get stuck in a loop not understanding what the problem is or what I expect.
Right, GLM is close But not close enough. If I have to spend $200 for Opus fallback i may as well not use it always. Still an unbelievable option if $200 is a luxury, the price-per-quality is absurd.
"run" as in run locally? There's not much you can do with that little RAM.
If remote models are ok you could have a look at MiniMax M2.1 (minimax.io) or GLM from z.ai or Qwen3 Coder. You should be able to use all of these with your local openai app.
I'm just speculating (please chime in if you know more!).
Afaiu protocol buffers are very popular for over the wire communication and in that use case cap'n proto does not win you much @ performance (?).
And if you don't go over the wire the performance difference might not matter for many use cases (e.g there is plenty of json sent around as well) and when it does matter: cap'n proto or something custom w/ desired performance characteristics other than protocol buffers is chosen but here cap'n proto covers a slim middleground (?).
+ usual adoption dynamics: protocol buffers have more mindshare and if another solution is not clearly way better for the use case: switching/trialing is not happening as much. (?)
I like how it (afaiu) leaves my codex config alone (meaning when I start codex the usual way I have the usual settings but when I launch it via ollama I get the model from ollama).
That said: many open weight models, while quite capable are not a great 1:1 fit for each agent harness and it's not easy to figure out if it is an issue with the model, model size, quantization, inference parameters or something else or a combination.
So for me it is a bit frustrating to pinpoint if a 'bad' experience is just 1 simple change away from a great experience.
For ollama specifically: make sure in settings you have set the context window size to something around 64k or 128k.
The default is 4k which is not enough for agent workflows. I have set it to 256k initially but that then was too much apparently (maybe i need more RAM).
Also it is possible that you have to restart ollama once you have changed the context window size or once you have downloaded the model (?).
Also check if the model you want to use is a good fit for your RAM, if in doubt pick a smaller model first.
I had good results out of the box with qwen3 coder and gpt 20b.
perhaps also more forking (not only absolute but also relative)
contribution dynamics are also changing
I'm fairly optimistic that generative ai is good for open source and the commons
what I'm also seeing is open source projects that had not so great ergonomics or user interfaces in general are now getting better thanks to generative ai
this might be the most directly noticeable change for users of niche open source
I don't agree that sibling to my comment: "make money by getting papers cited". it is not a long-term solution, much as Ad revenue is broken model for free software, also.
I'm hopeful that we see some vibe-coders get some products out that make money, and then pay to support the system they rely on for creating/maintaining their code.
Not sure what else to hope for, in terms of maintaining the public goods.
Any "exposure" economy has real money somewhere else turning the wheel. If that money isn't sensitive to healthy signal, neither is the downstream. Well-wishers are just alternative perpetual motion machine enthusiasts.
reply