Hacker Newsnew | past | comments | ask | show | jobs | submit | Imanari's commentslogin

How well do the Gemma 4 models perform on agentic coding? What are your impressions?

I will also leave this here

https://github.com/shareAI-lab/learn-claude-code/tree/main/a...

I found it excellent in explaining a CC-like coding agent in layers.


Isn’t this just kicking the can down the road?

> but the LLM is rediscovering knowledge from scratch on every question

Unless the wiki stays fully in context now the LLM hast to re-read the wiki instead of re-reading the source files. Also this will introduce and accumulate subtle errors as we start to regurgitate 2nd-order information.

I totally get the idea but I think next gen models with 10M context and/or 1000tps will make this obsolete.


> I totally get the idea but I think next gen models with 10M context and/or 1000tps will make this obsolete.

We've already got 1m context, 800k context, and they still start "forgetting" things around the 200k - 300k mark.

What use is 10M context if degradation starts at 200k - 300k?


I use a home baked system based on obsidian that is essentially just “obsidian but with structured format on top with schemas” and I deploy this in multiple places with ranges of end users. It is more valuable than you think. The intermediary layer is great for capturing intent of design and determining when implementation diverges from that. There will always be a divergence from the intent of a system and how it actually behaves, and the code itself doesn’t capture that. The intermediate layer is lossy, it’s messy, it goes out of date, but it’s highly effective.

It’s not what this person is describing though. A self referential layer like this that’s entirely autonomous does feel completely valueless - because what is it actually solving? Making itself more efficient? The frontier model providers will be here in 3 weeks doing it better than you on that front. The real value is having a system that supports a human coming in and saying “this is how the system should actually behave”, and having the system be reasonably responsive to that.

I feel like a lot of the exercises like op are interesting but ultimately futile. You will not have the money these frontier providers do, and you do not have remotely the amount of information that they do on how to squeeze the most efficiency in how they work. Best bet is to just stick with the vanilla shit until the firehose of innovation slows down to something manageable, because otherwise the abstraction you build is gonna be completely irrelevant in two months


Interesting, I'd love to know more. Are parts of it public?

Indeed, I have it open source, but want to preserve my anonymity here. The main gist of it is Quartz as a static site frontend bundle, backed by Decap as an editor, so that non technical users can edit documents. The validation is twofold - frontmatter is validated by a typical yaml validator library, and then I created markdown body validation using some popular markdown AST libraries, so there are two sets of schemas - one for the frontmatter, one for the body, and documents must conform via ci. I ship it with a basic cli that essentially does validation and has a few other utilities. Not really that much magic, maybe 500 lines of code or so in the CLI and another few hundred lines doing validation and the other utilties. It's all in typescript, so I use the same validation in Decap when people do edits.

The “next gen of models” argument is a valid one and one I think of often, but if you truly buy it, it would stop do creating anything - since the next gen of models could make it obsolete.

The goal isn’t to keep the context every time, it’s to make the memory queryable. Like a data lake but for your ideas and decisions

this solves for now, and this solves for the future.

now you get to condense the findings that interest from a handful of papers

in the future it solves for condensing your interests in a whole field to a handful of papers or less


It is how I feel when I do it. And it certainly shows over time.

Maybe to be better able to restart the process and not lose track.

Fascinating! I wonder if new training techniques could emerge from this. If we say layer-1=translater, layer2-5=reasoner, layer6 retranslater, could we train small 6 layer models but evaluate their performance in a 1>n*(2-5)>6 setup to directly train towards optimal middle-layers that can be looped? You'd only have to train 6 layers but get the duplication-benefit of the middle layers for free.


Yes, training directly for a diverse mix of "looped" inference procedures makes a lot of sense as a way of allowing for increased inference-time compute. It would likely be complementary to the usual thinking approach, which essentially runs the "loop" LLM-wide - and, critically, yields interpretable output which lets us see what the LLM is thinking about.


I don't know who you are and how you are so sure about 'what top labs are actually doing' but I have a similar feeling about the issue. The models dont have to 'actually learn', the setup has to approximate 'actual learning' just well enough to be usefull.

> AND it can inherit all the accumulated memories/docs from its predecessor.

So we are talking about a whole system, not just the model? Reminds me of something I heard a while back 'AGI will be a product, not a model'


It reminds me of the standard counter to the Chinese Room thought experiment: the person inside doesn’t understand Chinese, but the system _does_. The person, the rules, and the lookup tables together form the thing doing the understanding.


How is gemini 3.1 doing in agentic harnesses? Did they catch up?


"I want to implement <XYZ oftentime I use the mic and just ramble[0]>. Please explore the codebase and figure out how things work. Write down any questions you have. Then write an implementation plan. Do all of this in a dedicated markdown file."

The questions are usually 80% useless but those 20% often do point me to stuff I have not considered.

Then I edit the markdown manually or discuss some parts with the agent.

"go ahead and implement"

[0] - https://github.com/cjpais/Handy


I have been very impressed with this model and also with the Kimi CLI. I have been using it with the 'Moderato' plan (7 days free, then 19$). A true competitor to Claude Code with Opus.


How does it fare against CC?


Anecdotally, I've cancelled my Claude Code subscription after using Kimi K2.5 and Kimi CLI for the last few days. It's handled everything I've thrown at it. It is slower at the moment, but I expect that will improve.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: