Looks like solid incremental improvements. The UI oneshot demos are a big improv...

HumanOstrich · 2026-01-19T16:36:45 1768840605

I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.

They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.

twalla · 2026-01-19T17:10:02 1768842602

I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.

mlyle · 2026-01-19T18:08:49 1768846129

The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:

1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.

solarkraft · 2026-01-20T00:19:36 1768868376

It’s wild that cached tokens count full - what’s in it for you to care about caching at all then? Is the processing speed gain significant?

mlyle · 2026-01-20T02:13:34 1768875214

Not really worth it, in general. It does reduce latency a little. In practice, you do have a continuing context, though, so you end up using it whether you care or not.

indigodaddy · 2026-01-21T22:21:24 1769034084

Try a nano-gpt subscription. Not going to be as fast as cerebras obviously but it's $8/mo for 60,000 requests

Miraste · 2026-01-19T18:06:09 1768845969

I wonder why they chose per minute? That method of rate limiting would seem to defeat their entire value proposition.

p91paul · 2026-01-20T23:56:27 1768953387

In general, with per minute rate limiting you limit load spikes, and load spikes are what you pay for: they force you to ramp up your capacity, and usually you are then slow to ramp down to avoid paying the ramp up cost too many times. A VM might boot relatively fast, but loading a large model into GPU memory takes time.

cmrdporcupine · 2026-01-20T00:04:44 1768867484

I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.

It's even cheaper to just use it through z.ai themselves I think.

Imustaskforhelp · 2026-01-19T21:17:10 1768857430

I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser

Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.

With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.

I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.

Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)

pseudony · 2026-01-19T18:58:56 1768849136

I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.

Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).

My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.

(I even picked the 10usd plan, it was fine for now).

Workaccount2 · 2026-01-19T17:14:14 1768842854

Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.

People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.

runako · 2026-01-19T18:14:26 1768846466

FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.

They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)

weslleyskah · 2026-01-19T18:43:08 1768848188

You know, this is also the case of Proxmox vs. VMWare.

Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.

irthomasthomas · 2026-01-19T19:20:46 1768850446

Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.

skrebbel · 2026-01-19T17:52:03 1768845123

How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?

g-mork · 2026-01-19T18:18:10 1768846690

dont forget the plethora of middleman chat services with liberal logging policies. i've no doubt there is a whole subindustry lurking in here

skrebbel · 2026-01-19T21:21:34 1768857694

i wasn't judging, i was asking how it works. why would openai/anthrophic/google let a competitor scrape their results in sufficient amounts that it lets them train their own thing?

victorbjorklund · 2026-01-19T21:57:40 1768859860

I think the point is that they can't really stop it. Let's say that I purchase API credits, and I let the resell it to DeepSeek.

That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)

They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.

skrebbel · 2026-01-20T06:50:16 1768891816

Yeah but are we all just speculating or is it accepted knowledge that this is actually happening?

sally_glance · 2026-01-20T08:01:43 1768896103

Speculation I think, because for one those supposed proxy providers would have to provide some kind of pricing advantage compared to the original provider. Maybe I missed them but where are the X0% cheaper SOTA model proxies?

Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.

skrebbel · 2026-01-20T10:36:01 1768905361

Thanks I that case my conclusion is that all the people saying that these models are "distilling SOTA models" are, by extension, also speculating. How can you distill what you don't have?

sally_glance · 2026-01-20T22:16:45 1768947405

Only way I can think of is paying for synthesizing training data using SOTA models yourself. But yeah, I'm not aware of anyone publicly sharing that they did so it's also speculation.

The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.

What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.

mike_hearn · 2026-01-20T15:22:12 1768922532

OpenAI implemented ID verification for their API at some point and I think they stated that this was the reason.

behnamoh · 2026-01-19T17:05:46 1768842346

> The UI oneshot demos are a big improvement over 4.6.

This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.

mckirk · 2026-01-19T16:03:46 1768838626

Note that this is the Flash variant, which is only 31B parameters in total.

And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.

ttoinou · 2026-01-19T17:11:38 1768842698

Sonnet was already very good a year ago, do open weights model right are as good ?

jasonjmcghee · 2026-01-19T17:15:23 1768842923

Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago

cmrdporcupine · 2026-01-20T00:07:28 1768867648

From my experience, Kimi K2, GLM 4.7 (not flash, full), Mistral Large 3, and DeepSeek are all about Sonnet 4 level. I prefer GLM of the bunch.

If you were happy with Claude at its Sonnet 3.7 & 4 levels 6 months ago, you'll be fine with them as a substitute.

But they're nowhere near Opus 4.5