Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.
I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:
1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.
Not really worth it, in general. It does reduce latency a little. In practice, you do have a continuing context, though, so you end up using it whether you care or not.
In general, with per minute rate limiting you limit load spikes, and load spikes are what you pay for: they force you to ramp up your capacity, and usually you are then slow to ramp down to avoid paying the ramp up cost too many times. A VM might boot relatively fast, but loading a large model into GPU memory takes time.
I use GLM 4.7 with DeepInfra.com and it's extremely reasonable, though maybe a bit on the slower side. But faster than DeepSeek 3.2 and about the same quality.
It's even cheaper to just use it through z.ai themselves I think.
I know this might not be the most effective use case but I had ended up using the try AI feature in cerebras which opens up a window in browser
Yes, it has some restrictions as well but it still works for free. I have a private repository where I ended up creating a puppeteer instance where I can just input something in a cli and then get output in cli back as well.
With current agents. I don't see how I cannot just expand that with a cheap model like (think minimax2.1 is pretty good for agents) and get the agent to write the files and do the things and a loop.
I think the repository might have gotten deleted after I resetted my old system or similar but I can look out for it if this interests you.
Cerebras is such a good company. I talked to their CEO on discord once and have following it for >1-2 years now. I hope that they don't get enshittified with openAI deal recently & they improve their developer experience because people wish to pay them but now I had to do a shenanigan which was for free (but also its just that I was curious about how puppeteer works so I wanted to find if such idea was possible itself or not & I really didn't use it that much after building it)
I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.
Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
(I even picked the 10usd plan, it was fine for now).
FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.
They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)
Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.
i wasn't judging, i was asking how it works. why would openai/anthrophic/google let a competitor scrape their results in sufficient amounts that it lets them train their own thing?
I think the point is that they can't really stop it. Let's say that I purchase API credits, and I let the resell it to DeepSeek.
That's going to be pretty hard for OpenAI to figure out and even if they figure it out and they stop me there will be thousands of other companies willing to do that arbitrage. (Just for the record, I'm not doing this, but I'm sure people are.)
They would need to be very restrictive about who is allowed to use the API and not and that would kill their growth because because then customers would just go to Google or another provider that is less restrictive.
Speculation I think, because for one those supposed proxy providers would have to provide some kind of pricing advantage compared to the original provider. Maybe I missed them but where are the X0% cheaper SOTA model proxies?
Number two I'm not sure if random samples collected over even a moderately large number of users does make a great base of training examples for distillation. I would expect they need some more focused samples over very specific areas to achieve good results.
Thanks I that case my conclusion is that all the people saying that these models are "distilling SOTA models" are, by extension, also speculating. How can you distill what you don't have?
Only way I can think of is paying for synthesizing training data using SOTA models yourself. But yeah, I'm not aware of anyone publicly sharing that they did so it's also speculation.
The economics probably work out though, collecting, cleaning and preparing original datasets is very cumbersome.
What we do know for sure is that the SOTA providers are distilling their own models, I remember reading about this at least for Gemini (Flash is distilled) and Meta.
Note that this is the Flash variant, which is only 31B parameters in total.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.