I sense this question is pretty elementary, but maybe someone can point me in th...

dragontamer · on Aug 28, 2020

CPUs run Tomasulos algorithm ( https://en.m.wikipedia.org/wiki/Tomasulo_algorithm ). Then the outputs of Tomasulos algorithm allows for out of order execution across multiple execution units.

Note that Intel chips have many (at least 10) execution units, and AMD also has 10. You need to split instructions to the different pipelines for efficient execution. I forget exactly how much... but far more than one pipeline in any case.

If pipeline0 is busy doing a divide, run additions in Pipeline5. Tomasulos algorithm converts typical assembly language into this multi-pipeline form. It also keeps track of which pipelines are busy, and finally puts things back together in the right order when done. (Division can take 80 clock cycles, while addition takes 1. You gotta stall the results of the addition instruction until after division is complete. If division came first in the machine code)

Waterluvian · on Aug 28, 2020

Thanks for this. I had no idea what to search for but this is a good start.

I guess it's unsurprising to find that even at the single cpu core level, there's another layer of parallelism. And that will of course complicate things further.

dragontamer · on Aug 28, 2020

Note: Parallelism comes in many forms on CPUs. Pipelines, superscalar, out-of-order, and speculative. Of these concepts, Tomasulo's algorithm covers Pipelines, Superscalar, and out-of-order. So its the "most complex example", but that should get you up to speed the quickest. (Branch prediction / speculative is its own thing, but I've always found the concept to be simple to grasp: you're executing instructions in parallel, so you need to guess the results of a branch before the branch is even calculated).

When it comes to actual execution of the final microop itself, its surprisingly simple. You put the bits in the right location, then a circuit magically calculates the result.

XOR and Add are simple enough (https://en.wikipedia.org/wiki/Carry-save_adder). A multiplier is harder to think about, but yes, its just a bunch of wires hooked up to NAND gates: https://www.slideshare.net/suryakrishna3785/wallace-tree-mul...

Division and Modulo are traditionally executed with microops in a multi-step process. Addition becomes subtraction through the magic of 2s complement. AND, OR, XOR, are trivial logic gates. Memory lookup is complicated, but mostly due to caching. The core concept of memory lookup (put value onto address pins, wait for memory to respond with the stored value) is simple enough.

Memory is traditionally on a "bus". The idea of a "bus" (multiple different CPU components talking on the same set of wires) is hard to grasp. But when you learn about tri-state logic (https://en.wikipedia.org/wiki/Three-state_logic), the 5V state, 0V state, and "High Impedance" state, it becomes simple.

"High Impedance" means that someone connected to the wires simply won't read, or write, on those wires. The CURRENT (amps) going in, or out, is set to zero through the magic of transistors. If you have 5 different components (C0, C1, C2, C3, and C4) on a set of wires, if C1/C2/C3 are in "High Impedance", then C0 and C4 can talk without any issues (because C1 through C3 won't use up any electricity from the wires).

The key is coordinating who can talk and at what times. But that's where "link level protocols" start to come in. Tri-state logic (in particular: high impedance) is what makes it possible.

And there ya go. A fully working CPU. That's all it is.

kens · on Aug 28, 2020

Nice description! Even a simple ALU has a lot more complexity than you're giving it credit for, though. It's not "just" a bunch of gates, but is usually highly optimized with dynamic logic and pass-transistor logic. Addition in particular is not "simple", but uses various tricks to avoid carry propagation delay. And the various ALU operations are smashed together in clever ways to avoid separate circuits for XOR, add, AND, OR, etc. Even XOR is not a "trivial logic gate"; building XOR from e.g. NAND gates is surprisingly inefficient, so processors use a variety of special tricks for XOR. My point is that actual ALUs haven't matched the simple description since about 1972.

dragontamer · on Aug 28, 2020

> Addition in particular is not "simple"

Well... optimized addition isn't simple. But any beginner can implement naive carry-save adders on a breadboard (or Verilog/FPGAs) and get a functional Adder (maybe even ALU) in a weekend.

But you're right. To optimize addition (in particular: the propagation of carries), you need to use a Kogge-Stone Adder, which is pretty sophisticated (https://en.wikipedia.org/wiki/Kogge%E2%80%93Stone_adder). And that's from my Bachelor's degree: I'm sure there are newer algorithms that are more sophisticated than Kogge-Stone.

There's a lot of issues at play here at the electricity level. IE: Fanout. All components can only carry so much current (amps). For example, a transistor may only be able to feed 5 to 10 other transistors, so building "Buffers" to increase the current so that you can actually turn on all the transistors is a thing.

Each CMOS transistor has capacitance: so you need a certain amount of electrons before the transistor turns on. Given a level of current (amps), this means you need to wait for enough electrons to get onto the input before the transistor responds.

Doing all of this with the minimal energy usage and maximum speed (minimum latency) is surely complicated. But the "naive" version suitable for beginner play is pretty simple. Like raytracers: you can write a raytracer in just 100 lines of code (it won't be as fast as a professional raytracer, nor have as many features, but you'd get the gist in a single weekend project: https://github.com/matt77hias/smallpt)

---------

In any case, once you get to the final decoded instruction, its "just" a circuit. Sure, Kogge-Stone and Wallace Tree multipliers need a bit of study before you understand them. But they're simple black-boxes. Stick the bits into the correct wires and then a few hundred picoseconds later, you got the result coming out of another wire.

Waterluvian · on Aug 28, 2020

I had and lost a big response I typed up. And I don’t have the time to replicate it. But thank you both for such thoughtful discussion. This is giving me a really good broad sense of the topic. I’ve been implementing a game Boy emulator for fun and it’s raised so many questions.

sroussey · on Aug 28, 2020

I found that getting timing correct was the hardest part (at university, where everything was done by hand — later I worked in a fab where we could make that easier for people using tools).

CalChris · on Aug 28, 2020

The original problem that Tomasulo's algorithm (register renaming) solved was that the IBM 360 Model 91 floating point unit only had 4 FP registers. Register renaming is also important on x86 (starting with Pentium Pro) which only has 8 GPRs and x86_64 which only has 16 GPRs.

These (8 and 16) are still small numbers. I wonder how important renaming is on ARMv8 or RISC-V with 32 GPRs. I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.

The example that Agner found is a form of register renaming. So is this AMD trying to make its legacy code base go faster (AMD+Intel do a TON of that) or is it something a compiler can actually target? I think the former.

dragontamer · on Aug 28, 2020

Of the 16 GPRs on Icelake / Sunny Cove (Intel's next generation desktop core), there are 352 renaming registers available.

Of the 32 GPRs on ARM's Cortex A78 application processor, there are 160 renaming registers.

> I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.

The benefit of register renaming is that idioms like "xor eax, eax" automatically scale to whatever the reorder register size is.

"xor eax, eax" REALLY means " 'malloc' a register and call it EAX". Because the reorder buffer changes between architectures and even within an architecture (smaller chips may have smaller reorder buffers), its best to "cut all dependencies" at the compiler level, and then simply emit code where the CPU allocates registers in whatever optimal way.

The compiler doesn't care if you have 160-renaming registers (ARM A78), 224-renaming registers (Intel Skylake), or 300+ registers (Intel Icelake). The "xor eax, eax" idiom on every dependency cut emits the optimal code for all CPUs.

----------

You should have a large enough architectural register set to perform the calculations you need... in practice, 16 to 32 registers seem to be enough.

With the "dependency cutting" paradigm (aka: "xor eax, eax" really means malloc-register), your compiler's code will scale to all future processors, no matter the size of the reorder buffer of the particular CPU it ends up running on.

EDIT: It should be noted that on Intel Skylake / AMD Zen, "xor eax, eax" is so well optimized its not even a micro-op. Literally zero uop execution time for that instruction.

tpxl · on Aug 28, 2020

I will add a reference to one of the highest rated stack overflow posts about branch prediction (speculative execution):

https://stackoverflow.com/questions/11227809/why-is-processi...

jasonwatkinspdx · on Aug 28, 2020

This is THE textbook on the subject: https://www.amazon.com/Computer-Architecture-Quantitative-Jo...

There are older editions available free online that cover the same concepts, they just don't have the very latest info.

ch_123 · on Aug 28, 2020

On most x86 chips since the Pentium Pro (and the K6 for AMD, IIRC) the machine code instructions which are visible to the programmer are turned into "micro operations" by the CPU's instruction decoder. See here for a introduction: https://en.wikipedia.org/wiki/Micro-operation

Waterluvian · on Aug 28, 2020

Thanks for the link! So on older chips, like maybe an 8080, I could expect to see literal hardware implementations of instructions?

klelatti · on Aug 28, 2020

Ken Shirriff has an excellent series of blog posts on early Intel chips (running from 4004 to 8086 I think). Don't believe he's done one on the 8080 but the post on the 8008 is great [1] and I'd expect that the 8080 (which followed it and was designed by the same team) is very similar.

In short no microcode but there is a PLA (Programmable Logic Array) which helps to decode the instructions.

[1] http://www.righto.com/2016/12/die-photos-and-analysis-of_24....

kens · on Aug 28, 2020

Thanks for the nice comment! I haven't looked into the 8080 because other people have examined it in detail. See https://news.ycombinator.com/item?id=24101956 for an exact Verilog representation of the 8080.

Const-me · on Aug 28, 2020

Modern CPUs don’t have RISC cores, complex instructions are all the way down.

On current-gen CPUs, only rarely used instructions are decoded into multiple micro-ops, the majority of them decode into a single micro-op, even when they do many things at once like memory load + compute. More often the opposite happens, multiple AMD64 instructions are merged into a single micro-op, this is called “micro-ops fusion”.

dragontamer · on Aug 28, 2020

> Modern CPUs don’t have RISC cores, complex instructions are all the way down.

Proof: Intel's LEA instruction is executed as one uop. That's a super complicated instruction that you'd expect to be split into other uops, but... lo and behold, complex all the way down to the end silicon.

Even a "risc" machine: ARM has macro-op fusion, with AESE / AESMC instructions. (Macro-op fuse AESE / AESMC into a singular AES macro-op. A full AES cycle includes both steps, so its very rare to ever need to only do one of the steps).

As such, I've considered "RISC" to be a misnomer. Even RISC-V and ARM are now doing macro-op fusion (and micro-ops). So what exactly is RISC mean anyway?

saagarjha · on Aug 28, 2020

You can't do an operation on something in memory directly?

dragontamer · on Aug 28, 2020

That's just "Load/Store Architecture" though. RISC machines are usually load/store, but so are a bunch of other things, like GPUs and such. Hardly unique to RISC.

gpderetta · on Aug 28, 2020

> multiple AMD64 instructions are merged into a single micro-op, this is called “micro-ops fusion”.

very minor nit: it is actually called 'macro-op fusion', 'micro-op fusion' is when ops from a single instruction are put back together in part of the pipeline.

Otherwise, spot on.

PixelOfDeath · on Aug 28, 2020

The K5 was actually a RISC CPU with an x86 decoder on top. So we can count that one, too?!