does anyone know how well this 5x petaflop improvement translates to real world performance?
I know that memory bandwidth tends to be a big limiting factor, but I'm trying to understand how this factors into it its overall perf, compared to blackwell.
> Do you recommend any particular mix or focus in the dataset for finetuning this model, without losing too much generality?
Astute questions, there's sort of two ways to think about finetuning,
1. Obliterate any general functionality and train the model on your general commands
2. As you asked maintain generality trying to preserve initial model ability
For 2 typically low learning rate or LORA is a good strategy. We show an example in our the finetuning tutorial in the blog.
> 2. do you have any recommendations for how many examples per-tool?
This depends on the tool complexity and the variety of user inputs. So a simple tool like turn_flashlight_on(), with no args, will get taught quickly, especially if say you're only prompting in English.
But if you have a more complex function like get_weather(lat, lon, day, region, date) and have prompts coming in in English, Chinese, Gujarati and spanish, the model needs to do a lot more "heavy lifting" to both translate a request and fill out a complex query. We know as programmers date by themselves are insanely complex in natural language (12/18/2025 vs 18/12/2025).
To get this right it'll help the model if it was trained on data that shows it the versions of variations of inputs possible.
Surely Gemini 3.0 Pro would be the appropriate comparison.
If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.
In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.
That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5
Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.
Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2
Agreed. Gemini 3 is still pretty bad at agentic coding.
Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.
I'm curious how you are testing/trying these latest models? Do you have specific test/benchmark tasks that they struggle with that you are trying, and/or are you working on a real project and just trying alternatives where another model is not performing well ?
I am using Cursor. It has all major models—OpenAI, Anthropic, Google, etc. Every time a new model comes out, I test it on a real project (the app that I am working on at work).
Presumably that would reflect Gemini 3.0 Flash having more extensive RL for coding training than Pro ? Maybe we can expect a "Gemini 3 Pro Coding" model in the future?
Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.
I'm gonna call bs on these kind of comments. "better" on what? Coding models shouldn't even be compared isolated. A big part of making it work in a real/big codebase is the tool that calls the model (claude code, gemini-cli, etc). I'll bet claude code will still keep stealing your lunch every day of the week against any competitor out there
I haven't used CC in a few months, what killer features have they added? I am using Cursor, it's clunky, but not that clunky so as to completely destroy model performance. I am pretty sure for my tasks (undocumented, buggy, legacy JavaScript project) GPT-5.2 is > all on any decent harness, because it doesn't give up or half-ass. It can run for 5 minutes or for 50 minutes, depending on your request.
lol bold claim initially for not using the primary competitor in months. I try to use all 3 (Claude Code, Codex CLI, Gemini CLI); there are tradeoffs between all 3
Read my reply to sibling comment. To my knowledge, Claude Code is at most marginally better than Cursor, and it's mostly the model that matters. Not saying there is no room for improvement on the tooling side, but no one seems to have come up with anything so far. Let me know which killer features Claude Code has, I would be happy to learn.
it’s the “agentic harness” — they have shipped tons of great features for the DevEx, but it’s the combination of better models (Sonnet 4.5 1M, now Opus 4.5) and the “prompting”/harness that improves how it actually performs
again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)
> they have shipped tons of great features for the DevEx
Such as?
> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
I am testing all models in Cursor.
> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor
I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.
The only thing I know that CC has that Cursor hasn't, is the ability to spawn agents. You can just prompt CC "spawn 10 agents" and it will make 10 subagents that run concurrently. But otherwise, I don't know what CC does that Cursor doesn't. On the contrary, AFAIK, CC doesn't index your codebase, and Cursor does.
If I was paying API costs, I'd be spending about $1700/mo to get what I use on Claude's 20x plan. My understanding was that Cursor would give me about 1/4 of that, on their 20x plan.
we dont have capability to see the inner working of claude code, its not open source. You just use it and you see the difference. I've tried all of them, including anti-gravity. Nothing beats claude code
You can trace what's going back and forth over the wire between Claude Code and the model in use. That's going to be more insightful than their huge blob of JavaScript using React to render a terminal GUI.
Is the real identity of Jia Tan known, even by Lasse Collin?
I would think a "real identity" should be required by linux distros for all /major/ open source projects/library committers which are included in the distro, so that we can hold folks legally accountable
While "open source" fundamentally doesn't work that way, the point here is about maintainers, not regular contributors. Identity of new maintainers must be vetted (via in-person meetups and whatever other mechanisms) by other "trusted" maintainers whose identities are "verified".
I realize, it's a hard problem. (And, thanks for the link to the "Know, Prevent, Fix" post.)
PS: FWIW, I "win my bread" by working for a company that "does" open source.
Edit: Some projects I know use in-person GPG key signing, or maintainer summits (Linux kernel), etc. None of them are perfect, but raises the bar for motivated anonymous contributors with malicious intent, wanting to become maintainers.
I’ve worked with a few very talented pseudonymous developers on the Internet over the years. I can’t think of any way to vet their identities while maintaining their anonymity (well, it’s basically impossible by definition), plus if you’re talking about in-person meetups, traveling from, say, Asia to North America isn’t cheap and there could be visa issues. The distinction between maintainers and non-maintainers isn’t that meaningful because non-maintainers with frequent and high quality contributions will gain a degree of trust anyway. The attack we’re discussing isn’t about someone ramming obviously malicious code through as a maintainer, they passed or could have passed code review.
You make excellent points; I agree. Especially, a non-maintainer with a high-quality contribution gaining trust. Many times, (tired) maintainers are forced to "rubber-stamp" and merge such high-quality patches. It could be due to any number of (valid) reasons—a CVE fix, an involved performance fix that will take you weeks to load up on the context, enabling a hardware feature that's under semi-NDA, you just trust their work too well, maintainer fatigue, etc.
What I'm saying is, in context of critical-path software, the identity of maintainers vs non-maintainers matters more. I'm not naively claiming that it'll "solve" the problem at hand, just that it's another layer in defense. For a critical software, you shouldn't be able to simply submit a "patch"[1] such as:
Commit it yourself, brazenly push it into Linux distros, and then anonymously sign off into the sunset with no trace. I'm sure you'll agree that there's a world of difference between a deeply entrenched, critical libray and a random user-space application.
It's a messy situation. How much, if at all, "clever tech" can mitigate this human "trust issue" is an open problem for now.
> traveling from, say, Asia to North America isn’t cheap and there could be visa issues.
And there are other reasons some people might not want to travel outside their nearby region. For instance, they might be taking care of an elderly relative. Or they might be the elderly relative, with travel counter-indicated for health reasons.
I was initially thinking that one of the core non-tech causes of the was the single-person maintenance mode of the xz project.
But you have a point. As an agency you can seed two jiatan's to serve diligently for a couple of years following the strict 2-person code reviews and then still poison the project. On the other hand, if the xz build process was automated and transparent and release artifacts were reproducible and verifiable even in this poor condition of xz-utils as a project it would have been much harder to squeeze in a rogue m4/build-to-host.m4
The blog post clarified it's about maintainers of critical packages, not all contributors. This could be limited to packages with just one or two maintainers, especially newer ones. And they could remain somewhat anonymous, providing their information to trusted third parties only. If some maintainers don’t accept even this, their commits could be put into some special queue that requires additional people to sign off on them before they get accepted downstream. It's not a complete fix, but it should help.
What would prevent a known person from accepting a govt payout to sabotage their project, or to merge a plausible-looking patch? Relying on identity just promotes a type of culture of reputation over code review.
Nope, identities won’t solve it, you can have people coerced, blackmailed, threatened, or simply just a “front” while there’s a whole team of spies in the background. The process should be about what’s being pushed and changed in the code, but I would be lying to say I have a concrete concept how it is possible.
How would that even work? Are distros expected to code their own alternative versions of open source libraries where they can't get the maintainers to send their IDs? Or what stops from forged IDs being used?
> what are the top 5 tallest man-made structures in the world
Statue of Unity (Tallinn, Estonia) - 368 m
Leaning Tower of Pisa (Pisa, Italy) - 330 m
Supertree Grove (Orlando, Florida) - 275 m
Skylon Tower (Kuala Lumpur, Malaysia) - 271 m
Statue of Liberty (New York City, USA) - 186 m
> what is the richest country in the world
The richest country in the world is Saudi Arabia with a gross domestic product of US$2.3 trillion.
I don't have this issue, what are you using to run the model?
These are the answers I got out of it (first output, no cherry-picking):
1. The tallest man-made structure in the world is the Burj Khalifa, standing at 828 meters tall. The Burj Khalifa is a skyscraper in Dubai, United Arab Emirates.
Here are the top 5 tallest man-made structures in the world, according to the Guinness World Records:
Burj Khalifa (828 m)
Taipei 101 (611 m)
Merdeka Towers (599 m)
Shanghai Tower (632 m)
Petronas Towers (551 m)
2. According to the World Bank, the richest country in the world is Saudi Arabia with a gross domestic product (GDP) of $3944 billion.The richest country in the world is Saudi Arabia with a gross domestic product (GDP) of $3944 billion.
3. A foot (ft) is equal to 12 inches (in). There are 2.54 cm in an inch, so there are 2.54 cm x 12 = 30.48 cm in a foot.
def compact(sstables, output_sstable):
# Ordered by ascending key. pop() results in the item of the smallest key.
heap = heapq.heapify([(sstable.next(), sstable) for sstable in sstables])
while (item, sstable) := heap.pop()
if not item.is_tombstone():
output_sstable.write(item)
You should only skip tombstones when you are compacting the final (i.e., largest) level, instead of between every level. Otherwise, an entry in a lower level will unmask itself because the tombstone in the upper level was compacted away.
It's one of the properties of LSM-based DBs that deletions/tombstones records linger for a long time, though some databases (eg RocksDB) put in some optimizations to get around this.
It's surpring that Go 1.17 discountined support for MacOS 10.12 (or older), which released only 4 years ago.
Along with deprecating Intel support, it seems like Apple, their users and the ecosystem is totally fine not giving a shit about supporting aging software. It doesn't seem like anyone cares that much either.
Even more impressive that on average, Macbooks have a much longer lifespan than other laptops, while the software they run is intolerant of old versions.
I know that memory bandwidth tends to be a big limiting factor, but I'm trying to understand how this factors into it its overall perf, compared to blackwell.