More

avianes · on June 11, 2024

> GPUs are massively parallel devises, they need to keep the scheduler and ALU logic as simple and compact as possible

The simplest hardware implementation is not always the more compact or the more efficient. This is a misconception, example bellow.

> SIMT is just SIMD with some additional capabilities for control flow ..

In the Nvidia uarch, it does not. The key part of the Nvidia uarch is the "operand-collector" and the emulation of multi-ports register-file using SRAM (single or dual port) banking. In a classical SIMD uarch, you just retrieve the full vector from the register-file and execute each lane in parallel. While in the Nvidia uach, each ALU have an "operand-collector" that track and collect the operands of multiple in-flight operations. This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

When a warp is selected, the instruction is decoded, an entry is allocated in the operand-collector of each used ALU, and the list of register to read is send to the register-file. The register-file dispatch register reads to the proper SRAM banks (probably with some queuing when read collision occur). And all operand-collectors independently wait for their operands to come from the register-file, when an operand collector entry has received all the required operands, the entry is marked as ready and can now be selected by the ALU for execution.

That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

Obviously you can emulate an SIMT uarch using an SIMD uarch, but a think it's missing the whole point of SIMT uarch.

Nvidia do all of this because it allow to design a more compact register-file (memories with high number of port are costly) and probably because it help to better use the available compute resources with masked operations

ribit · on June 12, 2024

In an operand-collector architecture the threads are still executed in lockstep. I don't think this makes the basic architecture less "SIMD-y". Operand collectors are a smart way to avoid multi-ported register files, which enables more compact implementation. Different vendors use different approaches to achieve a similar result. Nvidia uses operand collectors, Apple uses explicit cache control flags etc.

> This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet. The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

> That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

It is my understanding that you need to synchronize threads when accessing shared memory. Not only different threads can execute on different SIMD, but also threads on the same SIMD can access shared memory over multiple cycles on some architectures. I do not see how thread synthconization relates to operand collectors.

avianes · on June 12, 2024

> In an operand-collector architecture the threads are still executed in lockstep. > [...] > It is my understanding that you need to synchronize threads when accessing shared memory.

Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

And yes, it's hard to expose de-synchronization without memory operations, so you only need sync for memory operation. (load/store unit also have operand-collector)

> You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet

That's true, but you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

> The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

Calling UE "SIMD unit" in an SIMT uarch add a lot of ambiguity, so I'm not sure a understand you point correctly. But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

ribit · on June 12, 2024

> Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Hm, the way I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep". In my understanding the role of the operand collector was to make sure that all register arguments are available before the instruction starts executing. If the operand collector needs multiple cycles to fetch the arguments from the register file, the instruction execution would stall.

So you are saying that my understanding is incorrect and that the instruction can be executed in multiple passes with different masks depending on which arguments are available? What is the benefit as opposed to stalling and executing the instruction only when all arguments are available? To me it seems like the end result is the same, and stalling is simpler and probably more energy efficient (if EUs are power-gated).

> But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

Ah, that is interesting, so the operand collector provides a limited reordering capability to maximize hardware utilization, right? I must have missed that bit in the patent, that is a very smart idea.

> But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)? Many GPUs claim dual-issue capability, but that either refers to interleaved execution from different programs (Nvidia, Apple) or a SIMD-within SIMT or maybe even a form of long instruction word (AMD). If I remember correctly, Nvidia instructions contain some scheduling information that tells the scheduler when it is safe to issue the next instruction from the same wave after the previous one started execution. I don't know how others do it, probably via some static instruction timing information. Apple does have a very recent patent describing dependency detection in an in-order processor, no idea whether it is intended for the GPU or something else.

> you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

I think this is essentially what some architectures describe as the "register file cache". What is nice about Nvidia's approach is that it seems to be fully automatic and can really make the best use of a constrained register file.

avianes · on June 12, 2024

> I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep".

Ok I see, that definitely not what I understood from my study of the Nvidia SIMT uarch. And yes I will claim that "the instruction can be executed in multiple passes with different masks depending on which arguments are available" (using your words).

> So the operand collector provides a limited reordering capability to maximize hardware utilization, right?

Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

> What is the benefit as opposed to stalling and executing the instruction only when all arguments are available?

That's a good question, note that: I think Apple GPU uarch do not work like the Nvidia one, my understanding is that Apple uarch is way closer to a classical SIMD unit. So it's definitely not killer to move form the original SIMT uarch from Nvidia.

That said, a think the SIMT uarch from Nvidia is way more flexible, and better maximize hardware utilization (executing instruction as soon as possible always help for better utilization). And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core. With a classical SIMD uarch it may be possible but you need extra hardware to handle warp execution overlapping, and even more hardware to enable overlapping more that 2 threads.

Also, Nvidia's operand-collector allow to emulate multi-ported register-file, this probably help with register sharing. There is actually multiple patent from Nvidia about non-trivial register allocation within the register-file banks, depending on how the register will be used to minimize conflict.

> Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)?

It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread. With SIMT most of the instruction parallelism is very explicit (with thread parallelism), so it's not really "extracted" (and not from the same thread). But anyway, if you question is either multiple instructions from a single warp can be executed in parallel (across different threads), then a would say probably yes for Nvidia (not sure, there is very few information available..), at least 2 instructions from the same thread block (from the same program, but different warp) should be able to be executed in parallel.

> I think this is essentially what some architectures describe as the "register file cache"

I'm not sure about that, there is actually some published papers (and probably some patents) from Nvidia about register-file cache for SIMT uarch. And that come after the operand-collector patent. But in the end it really depend what concept you are referring to with "register-file cache".

In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

ribit · on June 13, 2024

> Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

I understand, yes, it makes sense. Of course, other architectures can make other optimizations, like selecting warps that are more likely to have data ready etc., but Nvidia's implementation does sound like a very smart approach

> And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core

That is indeed a powerful technique

> It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.

Yes, I meant executing multiple instructions from the same warp/thread concurrently, depending on the execution granularity of course. Executing instructions from different warps in the same block is slightly different, since warps don't need to be at the same execution state. Applying the CPU terminology, warp is more like a "CPU thread". It does seem like Nvidia indeed moved quite far into the SIMT direction and their threads/lanes can have independent program state. So I thin I can see the validity of your arguments that Nvidia can remap SIMD ALUs on the fly to suitable threads in order to achieve high hardware utilization.

> In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

Got it, thanks!

P.S. By the way, wanted to thank you for this very interesting conversation. I learned a lot.

avianes · on June 5, 2024

Modern chip designs have an enormous amount of logic and therefore standard-cells. And when you are dealing with a huge amount of cell all together, it quickly becomes unmanageable, syntheses tools runtime explode, quality of results declines, results become chaotic, ..

So chip designs are spliced into partitions. Each partition is a part of your design that you synthesized separately. For example you may setup a partition for the core, and you can instantiate it multiple time into a core_cluster partition.

Note that: synthetiser's logical optimizer cannot work on logic across partitions, so you don't want to small partitions (otherwise you will have more manual optimize to do) but you also don't want too big partitions (otherwise runtime and development iteration time increase).

The question is what the good size for a partition ?

* ALU is ~ 10 K cells (synthesis runtime range from few seconds to ~5 min)

* small core (low-end) is ~1M cells (synthesis runtime range from 1~8 hours)

In the Intel terminology: "sea of FUBs" approach is to prefer small partitions, while "sea of cells" approach prefer big partitions.

About the predominance of latch or flop, it's mainly a consequence from the level of manual optimization. (latch are smaller, but harder to manage, and it give diminishing returns with new process node). Same for process-node-specific vs process-node-agnostic.

PS: Most modern designs are "sea of cells" according the Intel terminology

avianes · on May 31, 2024

Also, note that:

- There is only one Zero (encoded as 0x00), no negative-zero

- There is only one NaN (encoded as 0x80)

hi-v-rocknroll · on May 31, 2024

-0, qNANs, and sNANs of the world unite to raise an exception to this travesty!

avianes · on Feb 14, 2024

Are you aware that x86 and ARM/POWER/RISCV memory consistency model are really different? You can encounter very sneaky multitreading bug when running on ARM/POWER/RISCV a program that you have only tested on x86.

Apple has actually put a lot of effort to make the x86 to ARM transition as smooth as possible regarding memory consistency model, this is a strong indication that it's not as trivial as you seem to think.

brucehoult · on Feb 16, 2024

You have to be doing pretty strange things with threads, without using mutexes or atomic instructions, to have a problem with this.

RISC-V has a standardised (optional) TSO memory model mode, plus "fence.tso" instruction that works in normal mode, even on CPUs that predate it (it defaults to the stronger "fence rw,rw" in that case). Well, assuming the CPU core designer read the spec carefully and doesn't trap on unknown "fence" instructions (looking at you, THead...)

avianes · on March 23, 2023

Not sure what is exactly your thought, since the optimizations you quote don't really takes advantage of any exposed optimization feature of the language.

Are your asking why we could not expose optimization feature (e.g. branch hint to replace branch-prediction) in the programming language ?

Are you asking if it's difficult for a compiler to optimize some type of high-level languages ?

Since "higher-level" language is C in your question, what the "language that already exposes the optimizations in a friendly way" ? Are you thinking the CPU µop as a language ?

avianes · on March 23, 2023

> unless the compiler devs are working for the same company that makes the CPU

Every CPU manufacturing company have a compile team.

> This will never work

VLIW processors do work, and for a while now. This type of architecture performs better for data-intensive workloads, so you don't see them in the general-purpose world.

But if you are talking about Mill, yes it will never work.

ahartmetz · on March 23, 2023

> But if you are talking about Mill, yes it will never work.

Maybe Mill Computing won't pull it off, but why do you think the approach is bad? It seems sufficiently different from Itanium to not have the exact same problems.

marcosdumay · on March 23, 2023

They don't have a "deliver at all costs" philosophy and they are creating a new and different CPU architecture from scratch, a task where even teams oriented at delivering at all costs tend to almost always fail.

The architecture itself looks very good. If it was built in a way similar to RISC-V, it would probably become very influential. (But then, I'm not sure you can create something that innovative by the same procedures RISC-V was created.)

imtringued · on March 23, 2023

The architecture looks like a dead end compared to other novel architecture like the Reduceron which reached 33% of the performance of a single core 2 duo processor core while running on an FPGA with only 90MHz without using pipelining or out of order execution. The catch is that the execution model of the Reduceron is graph reduction which basically means lambda calculus/lazy functional programming languages. The Reduceron's raw arithmetic performance isn't better, but it doesn't matter because lazy evaluation is memory intensive, the more memory intensive the algorithm, the better the Reduceron performs, which is very unlike classical C machines like x86 or ARM or RISC-V where memory intensitivity must be avoided at all costs for good performance. Where the Reduceron does a single graph reduction per clock cycle, a C machine has to do it with several sequential instructions so a massive clock rate advantage isn't worth much.

avianes · on March 23, 2023

Well, first of all, because it shows no results after +10 years. There is definitely no indication that it will work someday.

And above all because there are too many choices that are too specific, outdated and too exotic. (e.g. the split-stream encoding is way too exotic)

I work in a small company that makes processors, and I know from experience that developing a processor is a very complicated subject, you have to go step by step (Mill does not). When you come up with a new design/idea, you try to simulate it, test it and implement it. You don't pile up new ideas without getting feedback on them.

JonChesterfield · on March 23, 2023

VLIW works _really_ well if you have deterministic memory timings, e.g. DSP/risc style architectures, and you're up for writing a complicated compiler.

wongarsu · on March 23, 2023

And at least the Intel compiler was/is pretty good at producing a single binary that uses whatever features your CPU has available by switching to different code paths depending on the cpuid flags. At least if your CPU is an Intel CPU, it's more conservative with that on AMD CPUs.

tormeh · on March 23, 2023

> Every CPU manufacturing company have a compile team

Sure, but that doesn't help when all the binaries are compiled for AMD64 anyway.

UncleEntity · on March 23, 2023

You can compile your own binaries if max performance is the goal.

Assuming you have access to the source code of course.

tormeh · on March 23, 2023

Intel and AMD are so good at running AMD64 code that it's often slower to use additional instructions.

avianes · on Dec 27, 2022

You could do unary encoding in a parallel register (and conversely binary encoding using a serial bit stream).

The essence of unary encoding is that the number is encoded by the number of bits set to 1 in the word, and not the position of the 1s in the word (as binary do). e.g. using unary you can encode integer number 3 as: "00000111".

But in the paper, they encode real numbers between 0 and 1 using unary and not integer numbers. Using unary, real number are represented as the ratio of the number of bits set to the total number of bits in the word. e.g. the word "00000111" will encode 3/8.

Also, they use rate-coded unary number, meaning that they do not require to stack the 1's at the beginning of the word. 1's a randomly placed within the word.

This allow you to implement stochastic multiplier using a bitwise AND. [1]

e.g. 4/8 could be encoded as "01011001"

2/8 could be encoded as "10010000"

and you can compute the product 4/8 * 2/8 using a bitwise AND of the two word:

"01011001" AND "10010000" = "00010000"

"00010000" encode 1/8 (and 4/8 * 2/8 = 1/8).

Using a serial bit stream rather than a parallel register allow to use a single AND gate.

[1]: https://en.wikipedia.org/wiki/Stochastic_computing

avianes · on Oct 31, 2022

Building a x86/ARM/RISC-V desktop or server class CPU core is more about the microarchitecture.

And RISC-V is an ISA, which is a part of the architecture not the microarchitecture.

"Ecosystem" here refers to: the compilers and tools, supported OS, the suggested ISA extensions, the research movements who work and experiment with it, and so on.

avianes · on Oct 25, 2022

> Separate FP registers - This looks to have started when FPUs were optional and/or physically separate, but that's no longer the case.

Using a separate register set for FP is not just about making floats optional. It also allows to better isolate the float and int units and to build a more efficient micro-architecture.

For example: using a single physical register bank for floats and integers would be expensive (as the size of the register bank grows quadratically with the number of read/write ports), therefore using separate physical register bank for float and integer is more efficient.

avianes · on Oct 24, 2022

The difficulty is not to decode a single instruction, the difficulty is to decode multiple instructions in parallel (let's say from 5 to 8 instructions in parallel).

In a modern high performance processor instructions are decoded in batches: Decoding the first instruction is straightforward. But x86 instructions range from 1 to 15 bytes, therefore the second instruction can start from byte-offset 1 up to 15. 3rd instruction has a byte-offset ranging from 2 to 30, ans so on. Furthermore, figuring out an x86 instruction length requires reading several byte from the instruction.

In the end, the 8th instruction has 99 possible byte-offset, and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle.

Of course we don't do that, it would consume a lot of energy for nothing.

To handle that, modern x86 processor instruction decoding involves a instruction length decode before the instruction decode. The instruction length decode is responsible for identifying the instruction positions and boundaries, and this instruction length decode is a challenging part of the x86 processor to design. We don't know how Intel or AMD exactly do instruction length decode, but we know that some published techniques include a length predictor.

That's why, for simplicity and energy efficiency, instruction boundaries must be easily identified and the number of instruction lengths must be kept low.

ajross · on Oct 24, 2022

> In the end, the 8th instruction has 99 possible byte-offset, and assuming that we put, as you suggest, a decoder for each position and length, we need about 1590 decoders and many multiplexer to decode 8 full instructions per cycle.

Um... wat? No CPU tries to decode 99 bytes of memory in a cycle. ADL is at 32 currently, I believe. And the instruction starting at byte 12 doesn't change depending on anything but it's own data. It either exists (because the previous instruction ended on byte 11) or it doesn't. So you decode 32 instructions starting at each byte you've fetched (the last ones can be smaller subset engines because they don't need to decode longer instruction forms), and then mask them on or off based on earlier instruction state. Then feed your 1-32 decoded instructions through a mux tree to pack them and you're done.

Surely there's more complexity, since this is going to have to be pipelined in practice, and a depth of 32 is going to require something akin to a carry-lookahead adder instead of being chained.

But the combinatorics you're citing seem ridiculous, I don't understand that at all.

avianes · on Oct 24, 2022

> Um... wat? No CPU tries to decode 99 bytes of memory in a cycle

Actually, no x86 processor decodes 8 instructions in parallel. This is an example to illustrate how the number of possible offsets scales with 15 instruction lengths.

> So you decode 32 instructions starting at each byte you've fetched

No you don't do that, it's too power consuming.

> But the combinatorics you're citing seem ridiculous, I don't understand that at all.

What I'm trying to explain is that decoding 8 instructions in parallel in x86 is hardly possible, while decoding 8 instructions (or more) from a RISC archi per cycle is never a problem

ajross · on Oct 24, 2022

> No you don't do that, it's too power consuming.

Uh... yes you do? How else do you think it works? I'm not saying there's no opportunity for optimization (e.g. you only do this for main memory fetches and not uOp execution, pipeline it such that the full decode only happens a stage after length decisions, etc...), I'm saying that it isn't remotely an intractable power problem. Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

avianes · on Oct 24, 2022

> Uh... yes you do? How else do you think it works?

No, I literally explain it in my first answer. The part about "1590 decoders" is irrelevant since a misunderstood your message (thinking that you are talking about using 16 decoders to decode the 16 instruction lengths of a single instruction).

But the rest on instruction length decode is how you actually do it.

> I'm saying that it isn't remotely an intractable power problem.

I mean, obviously, if you ignore all the power consumption issues of using 32 decoders in parallel and using only 5 of the results out of the 32. Then yes, there's no problem.

But in reality, yes it's a problem to decode many x86 instructions in parallel.

> Just draw it out: check the gates required for a 64->128 Dadda multiplier or 256 bit SIMD operation and compare with what you'd need here. It's noise.

Yes, the energy consumption of the multipliers is high, but I don't see how this is an argument to make an inefficient decoder? Also, a multiplier power consumption depends on transistor activity, and you can expect the MSB of the operand not to change too much. For decoder the transistor activity will be high.

> And your citation of "8 instructions in parallel" seems suspicious. Did I just get trolled into a Apple vs. x86 flame war?

Not a troll nor a flame war. I don't use Apple products, mainly because I don't agree with Apple practices. But actually choosing a RISC ISA allows them to decode a lot of instructions in parallel for little energy and complexity.

I chose 8 because it is the maximum that the mainstream will currently see. You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy

ajross · on Oct 24, 2022

> You might argue that 8 RISC instructions are not comparable with 8 CISC instructions, but even with say 4 CISC instructions it will still consume more energy

Alder Lake decodes six. And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.

avianes · on Oct 24, 2022

> And again, your intuition about power costs here is just simply wrong. Instruction decode is Simply Not a major part of the power budget of a modern x86 CPU. It's not.

I never said that instruction decode was a major part of the power budget.

And precisely, it is not because they don't decode 32 instructions in parallel. That's the purpose of an instruction length decoder prior to instruction decode.