Oh funny, I forgot about that. But at the time it didn't seem unreasonable to withhold a model that could so easily write fake news articles. I'm not so sure it wasn't..
Either 3 or 3.5 also had "it tricked our researchers and tried to to escape so we had to dumb it down before release" marketing. What actually happened was someone asked it something like "how would you escape detection" it replied telling the person that, basically just regurgitating a regular sci-fi plot.
If a different architecture to LLMs is invented (that could actually "think", that could potentially reach AGI), then perhaps it would be more efficient than LLMs. Perhaps LLMs can make themselves more efficient. They can't even remember "properly". Hallucinations cripple them for serious, professional uses. If they may hallucinate 5% of the time and you are asking mission critical queries, that's a problem.
Perhaps all of these data centers won't be needed. At least not by some of the current AI companies that won't keep up. If that happens to OpenAI, that would be quite a shock to the financial system (and GDP).
Microsoft's changes to windows have alienated some of their userbase. Copilot is poor compared to it's rivals. There's a reason they are down 20%. Linux adoption use is accelerating (still too low!).
And don't forget AI on device. When it becomes "good enough" for most tasks, data centre use will reduce.
With the talk of Nvidia backtracking and saying they won't invest 100 billion in OpenAI, Oracle in a poor financial position with the loan's for it's upcoming data centres becoming more expensive and dubious (they could fail to pay them)- the picture isn't as positive as you make it out to be. Which makes me think that you have an ulterior motive.
I mean I just said in my post the investment strategy that makes sense to me. But I'm here for knowledge exchange not pumping.
Here's the thing - we could list technical challenges / problems all day. And still, I use way more inference than a year ago. I'd use even more, a lot more, if it were faster (latency terms). I want to buy it, the providers want to sell it to me. So, your statement "hallucinations cripple them for .. professional uses" is just incorrect. The correct statement is "despite hallucinations, professional use is skyrocketing." Openclaw has like 150,000 GitHub stars in the last month. People are using inference at all levels of society.
I propose to you that if in fact we get some sort of AGI that is 10,000x more compute efficient than transformer architectures, then datacenter investment losses will no longer matter in a material way to almost anyone in the world. So, you might be right, but you've already got that 'trade' or 'return' banked -- cheap ubiquitous AGI as you propose might happen will provide broad benefits. In those terms you're sort of doubling up on your short by not getting some upside exposure to the long.
Re: MSFT, yep, it's a contrarian position. That said, I'm interested in informed short perspective on MSFT - do you think that the loss of windows licensing revenues would offset the benefit of being the world's "safe" local AI datacenter provider? And, are you sure that there is even a reduction in windows licensing? Satya's comments in a recent interview made it sound like they see agentic usage multiplying windows licenses -- basically when you spin up a web agent, it will lease a windows license to run the browser -- and parallel agents = multiple simultaneous leases - so they are seeing more and more revenue shift to azure in this world, away from direct license for desktops. To me, it feels like this will could be an incredible new era of platform lock-in for them - the azure stack is the only way to safely run gpt5 in a nationally protected datacenter - and oh, by the way, once you're signed up, one contract gets you a full MS software license.
The physical data centers including power, cooling, and fiber connectivity will be needed. Demand for compute capacity in some form is effectively infinite. But the current generation of CPUs / GPUs / TPUs inside those data center racks might turn out to be worthless if another disruptive innovation comes along.
I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.
When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.
I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".
Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)
Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).
Will bring back results soon.
Edit:
I (re-)tested:
- Gemini 3 (Pro)
- Gemini 3 (Flash)
- GPT 5.2
- Sonnet 4.5
Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.
These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).
Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.
As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.
These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.
The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.
--
To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).
It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.
reply