More

d4rkp4ttern · 2026-04-06T11:41:07 1775475667

You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:

https://pchalasani.github.io/claude-code-tools/integrations/...

The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:

[1] https://news.ycombinator.com/item?id=47616761

peder · 2026-04-06T13:22:36 1775481756

Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?

d4rkp4ttern · 2026-04-06T15:29:34 1775489374

Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.

selectodude · 2026-04-06T14:51:56 1775487116

I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.

d4rkp4ttern · 2026-04-06T23:37:46 1775518666

At least for the Gemma4-26B-A4B, Token-gen speed with OMLX is far worse on my M1 Max 64GB Macbook, compared to llama-server:

  Quick benchmark on M1 Max 64GB, Gemma 4 26B-A4B (MoE), comparing matched dynamic 4-bit quants. Workload
  was Claude Code, which sends ~35K tokens of input context per request (system prompt + tools + user
  message):

  llama.cpp (unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, llama-server -fa on -c 131072 --jinja --temp 1.0
  --top-p 0.95 --top-k 64):
  - pp ≈ 395 tok/s
  - tg ≈ 40 tok/s

  oMLX (unsloth/gemma-4-26b-a4b-it-UD-MLX-4bit, omlx serve --model-dir ~/models/omlx, with
  sampling.max_context_window and max_tokens bumped to 131072 in ~/.omlx/settings.json):
  - pp ≈ 350 tok/s
  - tg ≈ 5–13 tok/s

  Same model family and quant tier. Prompt processing is comparable, but oMLX's token generation is 3–7x
  slower than llama.cpp's Metal backend. Counter-intuitive given MLX is Apple's native ML framework.

vlowther · 2026-04-06T17:16:37 1775495797

Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.

peder · 2026-04-06T21:00:29 1775509229

Have you been using `omlx serve`? If so, how are you bumping up the max context size? I'm not seeing a param to go above 32k?

d4rkp4ttern · 2026-04-06T23:39:15 1775518755

you can set it in the .omlx/settings.json - ask a code-agent to figure it out by pointing it at the omlx repo

d4rkp4ttern · 2026-04-02T21:55:54 1775166954

For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code, which has anywhere between 15-40K tokens from the system prompt itself (+ tools/skills etc).

Here the 26B-A4B variant is head and shoulders above recent open-weight models, at least on my trusty M1 Max 64GB MacBook.

I set up Claude Code to use this variant via llama-server, with 37K tokens initial context, and it performs very well: ~40 tokens/sec, far better than Qwen3.5-35B-A3B, though I don't know yet about the intelligence or tool-calling consistency. Prompt processing speed is comparable to the Qwen variant at ~400 tok/s.

My informal tests, all with roughly 30K-37K tokens initial context:

    ┌────────────────────┬───────────────┬────────────┐
    │       Model        │ Active Params │ tg (tok/s) │
    ├────────────────────┼───────────────┼────────────┤
    │ Gemma-4-26B-A4B    │ 4B            │ ~40        │
    ├────────────────────┼───────────────┼────────────┤
    │ GPT-OSS-20B        │ 3.6B          │ ~17-38     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-30B-A3B      │ 3B            │ ~15-27     │
    ├────────────────────┼───────────────┼────────────┤
    │ GLM-4.7-Flash      │ 3B            │ ~12-13     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3.5-35B-A3B    │ 3B            │ ~12        │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-Next-80B-A3B │ 3B            │ ~3-5       │
    └────────────────────┴───────────────┴────────────┘

Full instructions for running this and other open-weight models with Claude Code are here:

https://pchalasani.github.io/claude-code-tools/integrations/...

JoshPurtell · 2026-04-02T22:06:54 1775167614

gpt oss 20b is not dense

d4rkp4ttern · 2026-04-02T22:11:50 1775167910

Thanks, fixed

d4rkp4ttern · 2026-04-01T14:41:28 1775054488

For me one of the most interesting aspects is how compaction works. It turns out compaction still preserves the full original pre-compaction conversation in the session jsonl file, and those are marked as "not to be sent to the API". Which means, even after compaction, if you think something was lost, you can tell CC to "look in the session log files to find details about what we did with XYZ". I knew this before the leak since it can be seen from the session logs. Some more details:

  The full conversation is preserved in the JSONL file, and messages
  are filtered before being sent to the API.

  Key mechanisms:

  1. JSONL is append-only — old pre-compaction messages are never deleted. New messages (boundary
  marker, summary, attachments) are appended after compaction.
  2. Messages have flags controlling API visibility:
    - isCompactSummary: true — marks the AI-generated summary message
    - isVisibleInTranscriptOnly: true — prevents a message from being sent to the API
    - isMeta — another filter for non-API messages
    - getMessagesAfterCompactBoundary() returns only post-compaction messages for API calls
  3. After compaction, the API sees only:
    - The compact boundary marker
    - The summary message
    - Attachments (file refs, plan, skills)
    - Any new messages after compaction
  4. Three compaction types exist:
    - Full compaction — API summarizes all old messages
    - Session memory compaction — uses extracted session memory as summary (cheaper)
    - Microcompaction — clears old tool result content when cache is cold (>1h idle)

manwe150 · 2026-04-02T02:53:54 1775098434

What is microcompaction? I didn’t realize there was any thing time based in CC, when I go eat dinner and come back it compacted while I was gone?

d4rkp4ttern · 2026-04-02T13:17:04 1775135824

I dug into this more. It's disabled by default, and it's a cost/token-usage optimization.

  The logic is:

  1. Anthropic's API has a server-side prompt cache with a 1-hour TTL
  2. When you're actively using a session, each API call reuses the cached prefix — you only pay
  for new tokens
  3. After 1 hour idle, that cache is guaranteed expired
  4. Your next message will re-send and re-process the entire conversation from scratch — every
  token, full price
  5. So if you have 150K tokens of old Grep/Read/Bash outputs sitting in the conversation, you're
  paying to re-ingest all of that even though it's stale context the model probably doesn't need

  The microcompact says: "since we're paying full price anyway, let's shrink the bill by clearing
  the bulky stuff."

  What's preserved vs lost:
  - The tool_use blocks (what tool was called, with what arguments) — kept
  - The tool_result content (the actual output) — replaced with [Old tool result content cleared]
  - The most recent 5 tool results — kept

  So Claude can still see "I ran Grep for foo in src/" but not the 500-line grep output from 2
  hours ago.

  Does it affect quality? Yes, somewhat — but the tradeoff is that without it, you're paying
  potentially tens of thousands of tokens to re-ingest stale tool outputs that the model already
  acted on. And remember, if the conversation is long enough, full compaction would have summarized
   those messages anyway.

  And critically: this is disabled by default (enabled: false in timeBasedMCConfig.ts:31). It's
  behind a GrowthBook feature flag that Anthropic controls server-side. So unless they've flipped
  it on for your account, it's not happening to you.

shreyssh · 2026-04-02T07:53:46 1775116426

[flagged]

AceJohnny2 · 2026-04-02T17:04:57 1775149497

> it's basically a cost optimization masquerading as a feature

Cost optimization in the user's favor.

Remember that every time you send a new message to the LLM, you are actually sending the entire conversation again with that added last message to the LLM.

Remember that LLMs are fixed functions, the only variable is the context input (and temperature, sure).

Naively, this would lead to quadratic consumption of your token quota, which would get ridiculously expensive as conversations stretch into current 100k-1M context windows.

To solve this, AI providers cache the context on the GPU, and only charge you for the delta in the conversation/context. But they're not going to keep that GPU cache warm for you forever, so it'll time out after some inactivity.

So the microcompaction-on-idle happens to soften the token consumption blow after you've stepped away for lunch, your context cache has been flushed by the AI provider, and you basically have to spend tokens to restart your conversation from scratch.

d4rkp4ttern · 2026-04-01T14:38:26 1775054306

that frustration regex is missing "idiot", which is the most common frustration word I use with code-agents

d4rkp4ttern · 2026-03-30T11:10:54 1774869054

Another option I use open is to ask the code-agent to make a diagram using Tikz (as a .tex file), which can then be converted to pdf/png.

But in general AI-diagramming is still unsolved; needs several iterations to get rid of wonky/wrong arrows, misplaced boxes, misplaced text etc.

jillesvangurp · 2026-03-30T12:27:08 1774873628

I've always liked umlet and umletino (web version) for a nice mix of drag and drop and edit by text editor. In the absence of good enough layout algorithms, the ability to manually drag things to the right place is kind of essential. The resulting diagrams are not so pretty of course.

I have tried a lot of tools in this space. If it comes out looking alright, that's usually because it was so simple that it didn't actually need a diagram. Anything with a bit of non trivial structure seems to quickly escalate with essentially no good options other then esoteric hacks with styling to make it look any good.

This seems to be a thing where you can have pretty automated layouts, complex diagrams, or correct diagrams and can only have two out of three.

Which means that almost 100% of my use cases for these tools never really work for me unless I sit down and grab some old school drawing tool (or just give up on the whole notion, which is much more likely). If it was trivial, I wouldn't bother making a diagram. These tools seem only usable for stuff where diagrams were overkill to begin with. I saw no examples on the linked article (and the rest of the site; I browsed the top few recent articles) to really counter this.

d4rkp4ttern · 2026-03-28T11:32:24 1774697544

Agree. For what it’s worth, in interviews Cherny (Claude Code creator) and Steinberger (OpenClaw creator) say they keep things simple and use none of the workflow frameworks. The latter even said he doesn’t even use plan mode, but I find that very useful: exiting plan mode starts clean with compressed context.

steveklabnik · 2026-03-28T13:20:48 1774704048

They backed out the “clear context and execute plan” thing recently. It’s a bummer, I thought it was great.

gruez · 2026-03-28T14:13:37 1774707217

Maybe they figured it wasn't need with 1M context?

theshrike79 · 2026-03-28T19:09:08 1774724948

Anecdotal evidence says that the 1M context one still gets stupid around 200-300k tokens.

Context still matters and I'll never stop implementing things in small slices instead of trying to one-shot.

d4rkp4ttern · 2026-03-14T11:56:20 1773489380

Thanks - I use worktrees and direnv but never thought about this ennrc trick to auto-share .env across worktrees.

cui · 2026-03-14T18:10:03 1773511803

But how did you use your main worktree's .env before? Symlink it?

d4rkp4ttern · 2026-03-14T11:52:35 1773489155

I see lots of discussions about humans no longer writing code but the elephant in the room is the rapid extinction of human-review of AI-made code. I expect this will lead to a massive hangover. In the meantime we try to mitigate this by ensuring the structure of code remains AI-friendly. I also expect some new types of tools to emerge that will help with this “cognitive debt”.

kusokurae · 2026-03-14T13:44:45 1773495885

My impression is that people who think that LLMs will completely release reviewing or writing code have never really worked on anything safety critical. I'm not looking forward to the next wave of pacemaker glitches.

kjkjadksj · 2026-03-14T16:42:27 1773506547

You act like we live in a world where companies are held sufficiently liable.

kusokurae · 2026-03-17T13:30:31 1773754231

I'm wrong for that. But whenever I suggest solutions for that, police officers visit me!

olsondv · 2026-03-14T12:34:12 1773491652

That is why I do not use the multi agent team technique. My code generation has atrophied, but my code review skills have only gotten stronger both for human and AI code. If I handed over both, it hurts my employability and will definitely lead to that hangover.

d4rkp4ttern · 2026-03-08T12:05:20 1772971520

For every new interesting open model I try to test PP (prompt processing) and TG (token gen) speeds via llama-cpp/server in Claude Code (which can have at least 15-30K tokens context due system prompt and tools etc), on my good old M1 Max 64GB MacBook.

With the latest llama-cpp build from source and latest unsloth quants, the TG speed of Qwen3.5-30B-A3B is around half of Qwen3-30B-A3B (with 33K tokens initial Claude Code context), so the older Qwen3 is much more usable.

Qwen3-30B-A3B (Q4_K_M):

  - PP: 272 tok/s | TG: 25 tok/s @ 33k depth

  - KV cache: f16

  - Cache reuse: follow-up delta processed in 0.4s

Qwen3.5-35B-A3B (Q4_K_M):

  - PP: 395 tok/s | TG: 12 tok/s @ 33k depth

  - KV cache: q8_0

  - Cache reuse: follow-up delta processed in 2.7s (requires --swa-full)

Qwen3.5's sliding window attention uses significantly less RAM and delivers better response quality, but at 33k context depth it generates at half the tok/s of the standard-attention Qwen3-30B.

Full llama-server and Claude-Code setup details here for these and other open LLMs:

https://pchalasani.github.io/claude-code-tools/integrations/...

regularfry · 2026-03-08T18:55:29 1772996129

I definitely get the impression there's something not quite right with qwen3.5 in llama.cpp. It's impressive but just a bit off. A patch landed yesterday which helped though.

ranger_danger · 2026-03-10T00:21:54 1773102114

Which patch are you referring to?

d4rkp4ttern · 2026-03-07T12:31:04 1772886664

I had exactly this problem and didn’t see anything good out there (Claude —resume only searches session names and auto-created titles) so I got a tool built that uses a Rust/Tantivy full text search index. It’s part of the aichat command suite, called “aichat search”:

https://pchalasani.github.io/claude-code-tools/tools/aichat/...

It brings up a nice TUI for filtering and further actions. There’s also a —json flag so agents can use it as a CLI search tool to find context about any past work. There’s a plugin that provides a corresponding session-searcher agent that knows to use this tool to search sessions.

I have hundreds/thousands of past sessions and this has been a life saver; I can just ask the main agent, “use the session searcher agent to get the details of how we built the tmux-cli tool so we can add some features”.