More

exacube · 2026-01-08T20:25:55 1767903955

does anyone know how well this 5x petaflop improvement translates to real world performance?

I know that memory bandwidth tends to be a big limiting factor, but I'm trying to understand how this factors into it its overall perf, compared to blackwell.

exacube · 2025-12-18T20:54:02 1766091242

Some fine tuning data questions:

i see the the dataset Google published in this notebook https://github.com/google-gemini/gemma-cookbook/blob/main/Fu... -- from looking at the dataset on huggingface, it looks synthetically generated.

1. do you recommend any particular mix or focus in the dataset for finetuning this model, without losing too much generality?

2. do you have any recommendations for how many examples per-tool?

thank you for your (and your teams) work!

canyon289 · 2025-12-18T21:13:49 1766092429

> Do you recommend any particular mix or focus in the dataset for finetuning this model, without losing too much generality?

Astute questions, there's sort of two ways to think about finetuning, 1. Obliterate any general functionality and train the model on your general commands 2. As you asked maintain generality trying to preserve initial model ability

For 2 typically low learning rate or LORA is a good strategy. We show an example in our the finetuning tutorial in the blog.

> 2. do you have any recommendations for how many examples per-tool? This depends on the tool complexity and the variety of user inputs. So a simple tool like turn_flashlight_on(), with no args, will get taught quickly, especially if say you're only prompting in English.

But if you have a more complex function like get_weather(lat, lon, day, region, date) and have prompts coming in in English, Chinese, Gujarati and spanish, the model needs to do a lot more "heavy lifting" to both translate a request and fill out a complex query. We know as programmers date by themselves are insanely complex in natural language (12/18/2025 vs 18/12/2025).

To get this right it'll help the model if it was trained on data that shows it the versions of variations of inputs possible.

Long answer but I hope this makes sense.

exacube · 2025-12-19T03:32:32 1766115152

it does; thanks so much, appreciate it!

exacube · 2025-12-18T18:22:20 1766082140

would love to see some comparison numbers to Gemini and Claude, especially with this claim:

"The most advanced agentic coding model for professional software engineers"

koakuma-chan · 2025-12-18T18:23:43 1766082223

I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

cj · 2025-12-18T18:28:22 1766082502

Gemini 2.5 or 3? (3 was released yesterday)

koakuma-chan · 2025-12-18T18:30:05 1766082605

I tried Gemini 3 Flash, and I am unimpressed. It's maybe a competitor to Cursor's Compose-1, but completely different league from GPT 5.2

HarHarVeryFunny · 2025-12-18T18:46:21 1766083581

Surely Gemini 3.0 Pro would be the appropriate comparison.

If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.

In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.

koakuma-chan · 2025-12-18T18:57:50 1766084270

That's what I said in my original message. By my account, GPT 5.2 is better than Gemini 3 Pro and Opus 4.5

Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.

Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2

postalcoder · 2025-12-18T19:08:40 1766084920

Agreed. Gemini 3 is still pretty bad at agentic coding.

Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.

misiti3780 · 2025-12-18T22:47:16 1766098036

Mkengin · 2025-12-18T22:41:33 1766097693

Your experience seems to match the recent results from swe-rebench: https://swe-rebench.com/

HarHarVeryFunny · 2025-12-19T14:58:05 1766156285

I'm curious how you are testing/trying these latest models? Do you have specific test/benchmark tasks that they struggle with that you are trying, and/or are you working on a real project and just trying alternatives where another model is not performing well ?

koakuma-chan · 2025-12-19T15:05:57 1766156757

I am using Cursor. It has all major models—OpenAI, Anthropic, Google, etc. Every time a new model comes out, I test it on a real project (the app that I am working on at work).

BeetleB · 2025-12-18T18:57:27 1766084247

Gemini 3.0 Flash outperforms Pro in many tasks - I believe the coding benchmark was one of them.

HarHarVeryFunny · 2025-12-19T14:55:40 1766156140

Presumably that would reflect Gemini 3.0 Flash having more extensive RL for coding training than Pro ? Maybe we can expect a "Gemini 3 Pro Coding" model in the future?

Opus 4.5 seems different - Anthropic's best coding model, but also their frontier general purpose model.

walthamstow · 2025-12-18T18:49:07 1766083747

Glad I'm not alone in thinking Flash 3 was like Composer 1 in speed but smarter

Tostino · 2025-12-18T20:28:43 1766089723

3 has been out for at least a couple weeks for me.

koakuma-chan · 2025-12-18T20:36:07 1766090167

He meant 3 Flash, which came out recently

speedgoose · 2025-12-18T19:05:50 1766084750

It’s significantly slower though. At least for my use cases I rather ask Claude 4.5 opus and switch to GPT if Claude is stuck.

nunodonato · 2025-12-18T18:33:23 1766082803

I'm gonna call bs on these kind of comments. "better" on what? Coding models shouldn't even be compared isolated. A big part of making it work in a real/big codebase is the tool that calls the model (claude code, gemini-cli, etc). I'll bet claude code will still keep stealing your lunch every day of the week against any competitor out there

koakuma-chan · 2025-12-18T18:37:17 1766083037

I haven't used CC in a few months, what killer features have they added? I am using Cursor, it's clunky, but not that clunky so as to completely destroy model performance. I am pretty sure for my tasks (undocumented, buggy, legacy JavaScript project) GPT-5.2 is > all on any decent harness, because it doesn't give up or half-ass. It can run for 5 minutes or for 50 minutes, depending on your request.

dkdcio · 2025-12-18T19:00:46 1766084446

lol bold claim initially for not using the primary competitor in months. I try to use all 3 (Claude Code, Codex CLI, Gemini CLI); there are tradeoffs between all 3

koakuma-chan · 2025-12-18T19:04:26 1766084666

Read my reply to sibling comment. To my knowledge, Claude Code is at most marginally better than Cursor, and it's mostly the model that matters. Not saying there is no room for improvement on the tooling side, but no one seems to have come up with anything so far. Let me know which killer features Claude Code has, I would be happy to learn.

dkdcio · 2025-12-18T19:09:31 1766084971

it’s the “agentic harness” — they have shipped tons of great features for the DevEx, but it’s the combination of better models (Sonnet 4.5 1M, now Opus 4.5) and the “prompting”/harness that improves how it actually performs

again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)

koakuma-chan · 2025-12-18T19:25:11 1766085911

> they have shipped tons of great features for the DevEx

Such as?

> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch

I am testing all models in Cursor.

> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor

I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.

dkdcio · 2025-12-18T19:27:48 1766086068

you conveniently ignored the most important part of my comment :)

> Such as?

changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...

glhf

btw you started this thread with pure vibes, no evidence:

> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.

I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim

koakuma-chan · 2025-12-18T20:01:05 1766088065

You do not seem to be able to tell me anything substantial, i.e. specifically how Claude Code is a better harness than Cursor.

> “prompting”/harness that improves how it actually performs

Is an abstract statement without any meaningful details.

nunodonato · 2025-12-18T18:42:03 1766083323

it's not about features (although they've added plenty), its the internal tooling and the way the model is prompted.

koakuma-chan · 2025-12-18T18:46:46 1766083606

The only thing I know that CC has that Cursor hasn't, is the ability to spawn agents. You can just prompt CC "spawn 10 agents" and it will make 10 subagents that run concurrently. But otherwise, I don't know what CC does that Cursor doesn't. On the contrary, AFAIK, CC doesn't index your codebase, and Cursor does.

NoveltyEngine · 2025-12-18T20:49:57 1766090997

Surely CC has a lower price? How much do you have to pay Cursor for equivalent to what's provided in a 20x Claude Max plan?

mejutoco · 2025-12-18T22:54:32 1766098472

200$

https://cursor.com/pricing

NoveltyEngine · 2025-12-21T20:26:32 1766348792

If I was paying API costs, I'd be spending about $1700/mo to get what I use on Claude's 20x plan. My understanding was that Cursor would give me about 1/4 of that, on their 20x plan.

koakuma-chan · 2025-12-18T21:52:05 1766094725

I don't know. My company pays for it.

nunodonato · 2025-12-18T20:05:23 1766088323

we dont have capability to see the inner working of claude code, its not open source. You just use it and you see the difference. I've tried all of them, including anti-gravity. Nothing beats claude code

HumanOstrich · 2025-12-18T22:11:02 1766095862

You can trace what's going back and forth over the wire between Claude Code and the model in use. That's going to be more insightful than their huge blob of JavaScript using React to render a terminal GUI.

Mkengin · 2025-12-18T22:39:12 1766097552

According to SWE-Rebench Anthropic and OpenAI are really close in performance, while GPT-5.2 costs less than half the cost of CC per problem.

https://swe-rebench.com/

exacube · 2025-12-09T17:48:14 1765302494

the only constant is HN's styling.

exacube · on Aug 29, 2024

yes that'd be great!

and if the price could somehow include the shipping rate to the country, that'd be awesome

exacube · on July 8, 2024

Doesn't Ontario's IESO also do this?

bryanlarsen · on July 8, 2024

Not really. Talking to energy producers, it's a lot easier to add new generation in Alberta than it is in Ontario.

exacube · on April 2, 2024

Is the real identity of Jia Tan known, even by Lasse Collin?

I would think a "real identity" should be required by linux distros for all /major/ open source projects/library committers which are included in the distro, so that we can hold folks legally accountable

rsc · on April 2, 2024

Open source fundamentally does not work that way. There are many important open source contributors who work pseudonymously.

Google's Know, Prevent, Fix blog post floated the idea of stronger identity for open source in https://security.googleblog.com/2021/02/know-prevent-fix-fra... and there was very significant pushback. We learned a lot from that.

The fundamental problem with stronger identity is that spy agencies can create very convincing ones. How are distros going to detect those?

kashyapc · on April 2, 2024

While "open source" fundamentally doesn't work that way, the point here is about maintainers, not regular contributors. Identity of new maintainers must be vetted (via in-person meetups and whatever other mechanisms) by other "trusted" maintainers whose identities are "verified".

I realize, it's a hard problem. (And, thanks for the link to the "Know, Prevent, Fix" post.)

PS: FWIW, I "win my bread" by working for a company that "does" open source.

Edit: Some projects I know use in-person GPG key signing, or maintainer summits (Linux kernel), etc. None of them are perfect, but raises the bar for motivated anonymous contributors with malicious intent, wanting to become maintainers.

oefrha · on April 2, 2024

I’ve worked with a few very talented pseudonymous developers on the Internet over the years. I can’t think of any way to vet their identities while maintaining their anonymity (well, it’s basically impossible by definition), plus if you’re talking about in-person meetups, traveling from, say, Asia to North America isn’t cheap and there could be visa issues. The distinction between maintainers and non-maintainers isn’t that meaningful because non-maintainers with frequent and high quality contributions will gain a degree of trust anyway. The attack we’re discussing isn’t about someone ramming obviously malicious code through as a maintainer, they passed or could have passed code review.

kashyapc · on April 2, 2024

You make excellent points; I agree. Especially, a non-maintainer with a high-quality contribution gaining trust. Many times, (tired) maintainers are forced to "rubber-stamp" and merge such high-quality patches. It could be due to any number of (valid) reasons—a CVE fix, an involved performance fix that will take you weeks to load up on the context, enabling a hardware feature that's under semi-NDA, you just trust their work too well, maintainer fatigue, etc.

What I'm saying is, in context of critical-path software, the identity of maintainers vs non-maintainers matters more. I'm not naively claiming that it'll "solve" the problem at hand, just that it's another layer in defense. For a critical software, you shouldn't be able to simply submit a "patch"[1] such as:

  tests: Add-binary-blob-with-a-subtle-backdoor.xz

  Signed-off-by: "Anonymous Rabbit" <LittleBunny123@lolmail.com>

Commit it yourself, brazenly push it into Linux distros, and then anonymously sign off into the sunset with no trace. I'm sure you'll agree that there's a world of difference between a deeply entrenched, critical libray and a random user-space application.

It's a messy situation. How much, if at all, "clever tech" can mitigate this human "trust issue" is an open problem for now.

[1] https://git.tukaani.org/?p=xz.git;a=commitdiff;h=cf44e4b7f5d

cesarb · on April 2, 2024

> traveling from, say, Asia to North America isn’t cheap and there could be visa issues.

And there are other reasons some people might not want to travel outside their nearby region. For instance, they might be taking care of an elderly relative. Or they might be the elderly relative, with travel counter-indicated for health reasons.

kashyapc · on April 2, 2024

I agree, these are all really valid reasons. FWIW, I myself have worked with "anonymous" maintainers and contributors that I've never met.

Vegenoid · on April 2, 2024

I'll bet many of them simply wouldn't want to.

nrvn · on April 2, 2024

I was initially thinking that one of the core non-tech causes of the was the single-person maintenance mode of the xz project.

But you have a point. As an agency you can seed two jiatan's to serve diligently for a couple of years following the strict 2-person code reviews and then still poison the project. On the other hand, if the xz build process was automated and transparent and release artifacts were reproducible and verifiable even in this poor condition of xz-utils as a project it would have been much harder to squeeze in a rogue m4/build-to-host.m4

delfinom · on April 2, 2024

My problem with stronger identity is it violates open source licenses.

Source code is provided without warranty and this statement is clear in the license.

Putting an verified identity behind the source code publish is basically starting to twist said said no-warranty. Fuck that.

in3d · on April 2, 2024

The blog post clarified it's about maintainers of critical packages, not all contributors. This could be limited to packages with just one or two maintainers, especially newer ones. And they could remain somewhat anonymous, providing their information to trusted third parties only. If some maintainers don’t accept even this, their commits could be put into some special queue that requires additional people to sign off on them before they get accepted downstream. It's not a complete fix, but it should help.

mapmeld · on April 2, 2024

What would prevent a known person from accepting a govt payout to sabotage their project, or to merge a plausible-looking patch? Relying on identity just promotes a type of culture of reputation over code review.

tamimio · on April 2, 2024

Nope, identities won’t solve it, you can have people coerced, blackmailed, threatened, or simply just a “front” while there’s a whole team of spies in the background. The process should be about what’s being pushed and changed in the code, but I would be lying to say I have a concrete concept how it is possible.

tester457 · on April 2, 2024

If this was done by a state actor then this policy wouldn't help at all. States have no shortage of identities to fake.

asvitkine · on April 2, 2024

How would that even work? Are distros expected to code their own alternative versions of open source libraries where they can't get the maintainers to send their IDs? Or what stops from forged IDs being used?

gquere · on April 2, 2024

This will never be accepted by the community.

exacube · on Feb 21, 2024

mmm..

> what are the top 5 tallest man-made structures in the world

  Statue of Unity (Tallinn, Estonia) - 368 m

  Leaning Tower of Pisa (Pisa, Italy) - 330 m

  Supertree Grove (Orlando, Florida) - 275 m

  Skylon Tower (Kuala Lumpur, Malaysia) - 271 m

  Statue of Liberty (New York City, USA) - 186 m

> what is the richest country in the world

  The richest country in the world is Saudi Arabia with a gross domestic product of US$2.3 trillion.

> how many cm are in a ft

  A foot is equal to 0.0254 metres (ft).

rany_ · on Feb 21, 2024

I don't have this issue, what are you using to run the model?

These are the answers I got out of it (first output, no cherry-picking):

1. The tallest man-made structure in the world is the Burj Khalifa, standing at 828 meters tall. The Burj Khalifa is a skyscraper in Dubai, United Arab Emirates.

Here are the top 5 tallest man-made structures in the world, according to the Guinness World Records:

    Burj Khalifa (828 m)
    Taipei 101 (611 m)
    Merdeka Towers (599 m)
    Shanghai Tower (632 m)
    Petronas Towers (551 m)

2. According to the World Bank, the richest country in the world is Saudi Arabia with a gross domestic product (GDP) of $3944 billion.The richest country in the world is Saudi Arabia with a gross domestic product (GDP) of $3944 billion.

3. A foot (ft) is equal to 12 inches (in). There are 2.54 cm in an inch, so there are 2.54 cm x 12 = 30.48 cm in a foot.

exacube · on Dec 15, 2023

Nice article!

There is a bug in the compaction method:

  def compact(sstables, output_sstable):
    # Ordered by ascending key. pop() results in the item of  the smallest key.
    heap = heapq.heapify([(sstable.next(), sstable) for sstable in sstables])

    while (item, sstable) := heap.pop()
        if not item.is_tombstone():
            output_sstable.write(item)

You should only skip tombstones when you are compacting the final (i.e., largest) level, instead of between every level. Otherwise, an entry in a lower level will unmask itself because the tombstone in the upper level was compacted away.

It's one of the properties of LSM-based DBs that deletions/tombstones records linger for a long time, though some databases (eg RocksDB) put in some optimizations to get around this.

tontinton · on Dec 15, 2023

You're right this was purposefully left out for brevity, in dbeel it is handled.

exacube · on Dec 15, 2023

Ah, makes sense :)

vlovich123 · on Dec 15, 2023

What kind of optimizations does RocksDB do? I know they have some stuff around range deletions but not sure I’ve read anything about point deletions.

exacube · on June 11, 2021

It's surpring that Go 1.17 discountined support for MacOS 10.12 (or older), which released only 4 years ago.

Along with deprecating Intel support, it seems like Apple, their users and the ecosystem is totally fine not giving a shit about supporting aging software. It doesn't seem like anyone cares that much either.

Even more impressive that on average, Macbooks have a much longer lifespan than other laptops, while the software they run is intolerant of old versions.