Hacker Newsnew | past | comments | ask | show | jobs | submit | andrewchilds's commentslogin


Many people have reported Opus 4.6 is a step back from Opus 4.5 - that 4.6 is consuming 5-10x as many tokens as 4.5 to accomplish the same task: https://github.com/anthropics/claude-code/issues/23706

I haven't seen a response from the Anthropic team about it.

I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.


Keep in mind that the people who experience issues will always be the loudest.

I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.

A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say


keep in mind that people who point out a regression and measure the actual #tok, which costs $money, aren't just "being loud" — someone diffed session context usaage and found 4.6 burning >7x the amount of context on a task that 4.5 did in under 2 MB⁣.

It's not that they don't have a point, it's that everyone who's finding 4.6 to be fine or great are not running out to the internet to talk about it.

Being a moderately frequent user of Opus and having spoken to people who use it actively at work for automation, it's a really expensive model to run, I've heard it burn through a company's weekend's credit allocation before Saturday morning, I think using almost an order of magnitude more tokens is a valid consumer concern!

I have yet to hear anyone say "Opus is really good value for money, a real good economic choice for us". It seems that we're trying to retrofit every possible task with SOTA AI that is still severely lacking in solid reasoning, reliability/dependability, so we throw more money at the problem (cough Opus) in the hopes that it will surpass that barrier of trust.


I've also seen Opus 4.6 as a pure upgrade. In particular, it's noticeably better at debugging complex issues and navigating our internal/custom framework.

Same here. 4.6 has been considerably more dilligent for me.

Likewise, I feel like it's degraded in performance a bit over the last couple weeks but that's just vibes. They surely vary thinking tokens based on load on the backend, especially for subscription users.

When my subscription 4.6 is flagging I'll switch over to Corporate API version and run the same prompts and get a noticeably better solution. In the end it's hard to compare nondeterministic systems.


That's very interesting!

Also, +1. Opus 4.6 is strictly better than 4.5 for me


Mirrors my experience as well. Especially the pro-activeness in tool calling sticks out. It goes web searching to augment knowledge gaps on its own way more often.

Do you need to upload your git for it to analyuze it? Or are they reading it off github ?

They're probably running it with a claude code like tool and it has a local (to the tool, not to anthropic) copy of the git repo it can query using the cli.

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

I think that's because Opus 4.6 has more "initiative".

Opus 4.6 can be quite sassy at times, the other day I asked it if it were "buttering me up" and it candidly responded "Hey you asked me to help you write a report with that conclusion, not appraise it."


I got the Max subscription and have been using Opus 4.6 since, the model is way above pretty much everything else I've tried for dev work and while I'd love for Anthropic to let me (easily) work on making a hostable server-side solution for parallel tasks without having to go the API key route and not have to pay per token, I will say that the Claude Code desktop app (more convenient than the TUI one) gets me most of the way there too.

Try https://conductor.build

I started using it last week and it’s been great. Uses git worktrees, experimental feature (spotlight) allows you to quickly check changes from different agents.

I hope the Claude app will add similar features soon


Can you explain what you mean by your parallel tasks limitation?

Instead of having my computer be the one running Claude Code and executing tasks, I might want to prefer to offload it to my other homelab servers to execute agents for me, working pretty much like traditional CI/CD, though with LLMs working on various tasks in Docker containers, each on either the same or different codebases, each having their own branches/worktrees, submitting pull/merge requests in a self-hosted Gitea/GitLab instance or whatever.

If I don't want to sit behind something like LiteLLM or OpenRouter, I can just use the Claude Agent SDK: https://platform.claude.com/docs/en/agent-sdk/overview

However, you're not supposed to really use it with your Claude Max subscription, but instead use an API key, where you pay per token (which doesn't seem nearly as affordable, compared to the Max plan, nobody would probably mind if I run it on homelab servers, but if I put it on work servers for a bit, technically I'd be in breach of the rules):

> Unless previously approved, Anthropic does not allow third party developers to offer claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead.

If you look at how similar integrations already work, they also reference using the API directly: https://code.claude.com/docs/en/gitlab-ci-cd#how-it-works

A simpler version is already in Claude Code and they have their own cloud thing, I'd just personally prefer more freedom to build my own: https://www.youtube.com/watch?v=zrcCS9oHjtI (though there is the possibility of using the regular Claude Code non-interactively: https://code.claude.com/docs/en/headless)

It just feels a tad more hacky than just copying an API key when you use the API directly, there is stuff like https://github.com/anthropics/claude-code/issues/21765 but also "claude setup-token" (which you probably don't want to use all that much, given the lifetime?)


Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct.

Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!

https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQDvsy5D...


I haven't kept up with the Claude plays stuff, did it ever actually beat the game? I was under the impression that the harness was artificially hampering it considering how comparatively more easily various versions of ChatGPT and Gemini had beat the game and even moved on to beating Pokemon Crystal.

The Claude Plays Pokemon stream with a minimal harness is a far more significant test of model intelligence compared to the Gemini Plays Pokemon stream (which automatically maintains a map of everything that has been seen on the current map) and the GPT Plays Pokemon stream (which does that AND has an extremely detailed prompt which more or less railroads the AI into not making this mistakes it wants to make). The latter two harnesses have become too easy for the latest generations of model, enough so that they're not really testing anything anymore.

Claude Plays Pokemon is currently stuck in Victory Road, doing the Sokoban puzzles which are both the last puzzles in the game and by far the most difficult for AIs to do. Opus 4.5 made it there but was completely hopeless, 4.6 made it there and is is showing some signs of maaaaaybe being eventually bruteforce through the puzzles, but personally I think it will get stuck or undo its progress, and that Claude 4.7 or 5 will be the one to actually beat the game.


I think this depends on what reasoning level your Claude Code is set to.

Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.

High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.


I set reasoning to Medium after hitting these issues and it did not make much of a difference. Most of the context window is still filled during the Explore tool phase (that supposedly uses Haiku swarms) which wouldn't be impacted by Opus reasoning.

I'm using the 1M context 4.6 and it's great.

Glad it's not just me. I got a surprise the other day when I was notified that I had burned up my monthly budget in just a few days on 4.6

In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.

In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.

In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.

Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.


Today I asked Sonnet 4.5 a question and I got a banner at the bottom that I am using a legacy model and have to continue the conversation on another model. The model button had changed to be labeled "Legacy model". Yeah, I guess it wasn't legacy a sec ago.

(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)


I definitely noticed this on Opus 4.6. I moved back to 4.5 until I see (or hear about) an improvement.

I’ve noticed the opaque weekly quota meter goes up more slowly with 4.6, but it more frequently goes off and works for an hour+, with really high reported token counts.

Those suggest opposite things about anthropic’s profit margins.

I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.


For me it's the ... unearned confidence that 4.5 absolutely did not have?

I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.

What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong


Have you tried framing the hypothesis as a question in the dispatch prompt rather than a statement? Something like -- possible cause: X, please verify before proceeding -- instead of stating it as fact. Might break the assumption inheritance without changing the overall structure.

After a month of obliterating work with 4.5, I spent about 5 days absolutely shocked at how dumb 4.6 felt, like not just a bit worse but 50% at best. Idk if it's the specific problems I work on but GP captured it well - 4.5 listened and explored better, 4.6 seems to assume (the wrong thing) constantly, I would be correcting it 3-4 times in a row sometimes. Rage quit a few times in the first day of using it, thank god I found out how to dial it back.

Here's the part where you don't leave us all hanging? What did you figure out!!!

I believe they just mean setting the model back to 4.5

In terms of performance, 4.6 seems better. I’m willing to pay the tokens for that. But if it does use tokens at a much faster rate, it makes sense to keep 4.5 around for more frugal users

I just wouldn’t call it a regression for my use case, i’m pretty happy with it.


Sonnet 4.5 was not worth using at all for coding for a few months now, so not sure what we're comparing here. If Sonnet 4.6 is anywhere near the performance they claim, it's actually a viable alternative.

Imo I found opus 4.6 to be a pretty big step back. Our usage has skyrocketed since 4.6 has come out and the workload has not really changed.

However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.

So odds of them addressing issues or responding to people feels low.


I wonder if it's actually from CC harness updates that make it much more inclined to use subagents, rather than from the model update.

I have often noticed a difference too, and it's usually in lockstep with needing to adjust how I am prompting.

Put in a different way, I have to keep developing my prompting / context / writing skills at all times, ahead of the curve, before they're needed to be adjusted.


> Many people have reported Opus 4.6 is a step back from Opus 4.5.

Many people say many things. Just because you read it on the Internet, doesn't mean that it is true. Until you have seen hard evidence, take such proclamations with large grains of salt.


Definitely my experience as well.

No better code, but way longer thinking and way more token usage.


I much prefer 4.6. It often finds missed edge cases more often than 4.5. If I cared about token usage so much, I would use Sonnet or Haiku.

It goes into plan mode and/or heavy multiple agent for any reasons, and hundred thousands of tokens are used within a few minutes.

I've been tempted to add to my CLAUDE.md "Never use the Plan tool, you are a wild rebel who only YOLOs."

Opus 4.6 is so much better at building complex systems than 4.5 it's ridiculous.

I fail to understand how two LLMs would be "consuming" a different amount of tokens given the same input? Does it refer to the number of output tokens? Or is it in the context of some "agentic loop" (eg Claude Code)?

Most LLMs output a whole bunch of tokens to help them reason through a problem, often called chain of thought, before giving the actual response. This has been shown to improve performance a lot but uses a lot of tokens

Yup, they all need to do this in case you're asking them a really hard question like: "I really need to get my car washed, the car wash place is only 50 meters away, should I drive there or walk?"

One very specific and limited example, when asked to build something 4.6 seems to do more web searches in the domain to gather latest best practices for various components/features before planning/implementing.

I've found that Opus 4.6 is happy to read a significant amount of the codebase in preparation to do something, whereas Opus 4.5 tends to be much more efficient and targeted about pulling in relevant context.

And way faster too!

They're talking about output consuming from the pool of tokens allowed by the subscription plan.

thinking tokens, output tokens, etc. Being more clever about file reads/tool calling.

I called this many times over the last few weeks on this website (and got downvoted every time), that the next generation of models would become more verbose, especially for agentic tool calling to offset the slot machine called CC's propensity to light the money on fire that's put into it.

At least in vegas they don't pour gasoline on the cash put into their slot machines.


not in my experience

"Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium."[1]

I doubt it is a conspiracy.

[1] https://www.anthropic.com/news/claude-opus-4-6


Yeah, I think the company that opens up a bit of the black box and open sources it, making it easy for people to customize it, will win many customers. People will already live within micro-ecosystems before other companies can follow.

Currently everybody is trying to use the same swiss army knife, but some use it for carving wood and some are trying to make some sushi. It seems obvious that it's gonna lead to disappointment for some.

Models are become a commodity and what they build around them seem to be the main part of the product. It needs some API.


I agree that if there was more transparency it might have prevented the token spend concerns, which feels caused by a lack of knowledge about how the models work.

Don't take this seriously, but here is what I imagined happened:

Sam/OpenAI, Google, and Claude met at a park, everyone left their phones in the car.

They took a walk and said "We are all losing money, if we secretly degrade performance all at the same time, our customers will all switch, but they will all switch at the same time, balancing things... wink wink wink"


The word "iran" is currently mentioned exactly zero times on https://nytimes.com. Plenty of baking tips though.


Maybe it’s just me, but I have serious issue with Fastmail’s spam filtering. Meaning, it seems that I am the spam filter. I find myself considering closing the account on a regular basis because of it.

I am also extremely frustrated with Gmail’s AI features now being apparently impossible to disable.


> Fastmail’s spam filtering

I had issues with it early on, and after some back and forth with support they explained to me what is going on: it takes about 200 spam emails before your personalized filter kicks in, before that you might see something slip through. Also, the spam filter updates after you delete emails from your spam folder. Just remember to empty your spam folder, and add some custom rules until you reach that 200 threshold.

5+ years in, everything is working great, nothing to complain, best email provider I've ever used. I only wish they had servers in the EU.


This has been my number one request for fastmail for years. They're happy with GDPR compliance but no intentions of having EU servers.


Fastmail's spam filter has been better than Gmail's spam filter in my experience.


Thier spam filter is really great, probably only one time they flagged a legit email as a spam. On the other hand, the email I have that uses fastmail never ended in someone’s spam.


Not the parent but my family also has a mac mini to offline TV setup - just a small bluetooth keyboard/mouse and the tv remote for volume. Works well.

As far as I know there are no remotes that work with MacOS.


My hope is this will lead to the reversal of the insane "HDR auto-brightness" behavior that causes any photo marked as HDR to go extreme brightness mode while the rest of the screen darkens:

https://discussions.apple.com/thread/254050878 https://apple.stackexchange.com/questions/436949/why-are-som...


If your definition of a return on your blogging/writing investment is how many likes you got, you're doing it for the wrong reasons.

I am in no way a good writer, and I don't have an audience, however a few of the articles I've published on my personal site have resulted in a small number of extremely high quality responses from almost exactly the people I wanted to reach. For example, I wrote a review of an insulin pump and received a reply a few days later from a director at the company thanking me for the review and that he was sharing it with his team.

So I'd say blogging can absolutely can pay off, if you think of it in terms of making connections with the right people over time.


It’s not just David Lynch’s own coffee drinking. Making coffee (“there was a fish in the percolator!”) and drinking coffee was one of the most memorable aspects of the original Twin Peaks TV series.


Totally agree. There should be a digital consumer bill of rights, that includes the right to disable autoplaying videos. Even the NYTimes home page is plastered with them these days, it’s awful. Another one is control over sort order, having the option of “most recent first”, vs. “most likely to keep you here”.


I also detest autoplay. But: you already have the right. Don't visit those sites.


The point was that we should have some kind of social contract that pushes back against the worst known addictive product choices, and to create public awareness about them. Mobile Safari has some limited functionality built-in like "Hide Distracting Items" and reader mode, but it's not enough. iOS has Accessibility > Motion > Auto-Play Video Previews, but that seems to only impact Messages.


> Don't visit those sites.

This is not a non-argument used keep things the same (and bad).


It's also an argument to protect the rights of the actual owner of the site. I'll put whatever I want on my website, and you can come to or not. I don't need a bill of rights that limits my freedom of expression. I will present my voice in any fashion I see fit on my website.

Use a browser that allows you to modify the content in a way that presents the content you want. That's the boundary you have to play with. My computer gives the information I want to your computer. Your computer does what it wants with that information.

If your OS doesn't allow you to use a browser of your choice, then that is the restriction you attack. Not mine.


You are free to do whatever you want with your website. This is about the limited number of platforms where communities exist online these days, and how those platforms knowingly using addictive and harmful UX patterns to keep people hooked. I think you and I deserve some level of consumer protection from that.


I built (and still use) a similar tool called Overcast 10 years ago: https://github.com/andrewchilds/overcast


Dude, your tool does so much than just run ssh commands. I just took a quick glance at your project, just wanted to know does this have support for vultr?


It does not support Vultr currently, but it likely wouldn't be difficult to add. There's an open issue for it: https://github.com/andrewchilds/overcast/issues/58



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: