One of the biggest problems with CPUs is legacy. Tie yourself to any legacy, and...

AnthonyMouse · 2025-06-06T18:41:34 1749235294

Modern CPUs don't actually execute the legacy instructions, they execute core-native instructions and have a piece of silicon dedicated to translating the legacy instructions into them. That piece of silicon isn't that big. Modern CPUs use more transistors because transistors are a lot cheaper now, e.g. the i486 had 8KiB of cache, the Ryzen 9700X has >40MiB. The extra transistors don't make it linearly faster but they make it faster enough to be worth it when transistors are cheap.

Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.

Sohcahtoa82 · 2025-06-06T18:36:41 1749235001

> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?

Would be interesting to see a benchmark on this.

If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.

If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.

> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.

I doubt you'd get significantly more performance, though you'd likely gain power efficiency.

Half of what you described in your hypothetical instruction set are already implemented in ARM.

ahartmetz · 2025-06-07T11:31:53 1749295913

A Ryzen is muuuuch more than 10-15x faster than a 486, and AVX et al do diddly squat for a lot of general-purpose code.

Clock speed is about 50x and IPC, let's say, 5-20x. So it's roughly 500x faster.

Sohcahtoa82 · 2025-06-08T05:33:56 1749360836

I meant a comparison on a clock-for-clock level. In other words, imagine either the 486 running at the clock speed of a Ryzen, or the Ryzen running at the clock speed of the 486. In other other words, compare ONLY IPC.

The line I was commenting on said:

> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock?

Emphasis added by me.

layla5alive · 2025-06-06T22:39:44 1749249584

In terms of FLOPS, Ryzen is ~1,000,000 times faster than a 486.

For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.

It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.

In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.

The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.

Szpadel · 2025-06-06T20:12:43 1749240763

that's exactly why Intel proposed x86S

that's basically x86 without 16 and 32 bit support, no real mode etc.

CPU starts initialized in 64bit without all that legacy crap.

that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.

risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.

this also could be used to remove legacy parts without disrupting architecture

kvemkon · 2025-06-06T18:37:48 1749235068

Would be interesting to compare transistor count without L3 (and perhaps L2) cache.

16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.

It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.

kvemkon · 2025-06-06T18:49:22 1749235762

E.g. AMD Radeon PRO VII with 13.23 billion transistors achieves 6.5 TFLOPS FP64 in 2020 [1].

[1] https://www.techpowerup.com/gpu-specs/radeon-pro-vii.c3575

zozbot234 · 2025-06-06T19:00:15 1749236415

> 16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.

These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.

kvemkon · 2025-06-06T20:02:43 1749240163

Sure, not for BLAS Level 1 and 2 operations. But not even for Level 3?

layla5alive · 2025-06-06T22:19:14 1749248354

Huh, consumer GPUs are doing Petaflops of floating point. FP64 isn't a useful comparison because FP64 is nerfed on consumer GPUs.

kvemkon · 2025-06-06T22:32:50 1749249170

Even recent nVidia 5090 has 104.75 TFLOPS FP32.

It's useful comparison in terms of achievable performance per transistor count.

saati · 2025-06-06T20:20:35 1749241235

But history showed exactly the opposite, if you don't have an already existing software ecosystem you are dead, the transistors for implementing x86 peculiarities are very much worth it if people in the market want x86.

colechristensen · 2025-06-06T18:28:33 1749234513

GPUs scaled wide with a similar number of transistors to a 486 and just lots more cores, thousands to tens of thousands of cores averaging out to maybe 5 million transistors per core.

CPUs scaled tall with specialized instruction to make the single thread go faster, no the amount done per transistor does not scale anywhere near linearly, very many of the transistors are dark on any given cycle compared to a much simpler core that will have much higher utilization.

zozbot234 · 2025-06-06T18:30:46 1749234646

> Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?

I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.

For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.

AtlasBarfed · 2025-06-06T23:58:08 1749254288

If you have a very large CPU count, then I think you can dedicate a CPU to only process a given designated privacy/security focused execution thread. Especially for a specially designed syscall, perhaps

That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.

But maybe I'm being too optimistic

hnaccount_rng · 2025-06-07T06:20:58 1749277258

Isn't the problem the "labelling" of "privacy-sensitive" in the first place?

dist-epoch · 2025-06-06T18:35:07 1749234907

If you look at a Zen5 die shots half of the space is taken by L3 cache.

And from each individual core:

- 25% per core L1/L2 cache

- 25% vector stuff (SSE, AVX, ...)

- from the remaining 50% only about 20% is doing instruction decoding

https://www.techpowerup.com/img/AFnVIoGFWSCE6YXO.jpg

zozbot234 · 2025-06-06T18:39:32 1749235172

The real issue with complex insn decoding is that it's hard to make the decode stage wider and at some point this will limit the usefulness of a bigger chip. For instance, AArch64 chips tend to have wider decode than their close x86_64 equivalents.

epx · 2025-06-06T18:40:35 1749235235

Aren't 99,99999% of these transistors used in cache?

nomel · 2025-06-06T21:19:30 1749244770

Look up "CPU die diagram". You'll see the physical layout of the CPU with annotated blocks.

Zen 3 example: https://www.reddit.com/r/Amd/comments/jqjg8e/quick_zen3_die_...

So, more like 85%, or around 6 orders of magnitude difference from your guess. ;)

PopePompus · 2025-06-06T18:53:16 1749235996

Gosh no. Often a majority of the transistors are used in cache, but not 99%.

Const-me · 2025-06-07T17:15:18 1749316518

CPUs can’t do that, but legacy is irrelevant. They just don’t have enough parallelism to leverage all these extra transistors. Let’s compare the 486 with a modern GPU.

Intel 80486 with 1.2M transistors delivered 0.128 flops / cycle.

nVidia 4070 Ti Super with 45.9B transistors delivers 16896 flops / cycle.

As you see, each transistor became 3.45 times more efficient at delivering these FLOPs per cycle.

johnklos · 2025-06-06T18:26:14 1749234374

> and the difference in number of transistors is 1,250 times

I should've written per core.

amelius · 2025-06-06T18:39:30 1749235170

> and now you're spending millions of transistors

and spending millions on patent lawsuits ...

smegger001 · 2025-06-07T01:26:38 1749259598

correct me if i am wrong but isn't that what was tried with the Intel Itanium processors line, only the smarter compilers and assemblers never quiet got there.

what makes it more likely to work this time?

PhilipRoman · 2025-06-07T12:23:32 1749299012

Optimizing compiler technology was still in the stone age (arguably still is) when Itanium was released. LLVM had just been born and GCC didn't start using SSA until 2005. Egraphs were unheard of in context of compiler optimization.

That said, yesterday I saw gcc generate 5 KB of mov instructions because it couldn't gracefully handle a particular vector size so I wouldn't get my hopes up...