Of the 16 GPRs on Icelake / Sunny Cove (Intel's next generation desktop core), there are 352 renaming registers available.
Of the 32 GPRs on ARM's Cortex A78 application processor, there are 160 renaming registers.
> I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.
The benefit of register renaming is that idioms like "xor eax, eax" automatically scale to whatever the reorder register size is.
"xor eax, eax" REALLY means " 'malloc' a register and call it EAX". Because the reorder buffer changes between architectures and even within an architecture (smaller chips may have smaller reorder buffers), its best to "cut all dependencies" at the compiler level, and then simply emit code where the CPU allocates registers in whatever optimal way.
The compiler doesn't care if you have 160-renaming registers (ARM A78), 224-renaming registers (Intel Skylake), or 300+ registers (Intel Icelake). The "xor eax, eax" idiom on every dependency cut emits the optimal code for all CPUs.
----------
You should have a large enough architectural register set to perform the calculations you need... in practice, 16 to 32 registers seem to be enough.
With the "dependency cutting" paradigm (aka: "xor eax, eax" really means malloc-register), your compiler's code will scale to all future processors, no matter the size of the reorder buffer of the particular CPU it ends up running on.
EDIT: It should be noted that on Intel Skylake / AMD Zen, "xor eax, eax" is so well optimized its not even a micro-op. Literally zero uop execution time for that instruction.
Of the 32 GPRs on ARM's Cortex A78 application processor, there are 160 renaming registers.
> I really wonder if compiler register allocators help/hurt/know about renaming which in fact would require microarchitectural knowledge.
The benefit of register renaming is that idioms like "xor eax, eax" automatically scale to whatever the reorder register size is.
"xor eax, eax" REALLY means " 'malloc' a register and call it EAX". Because the reorder buffer changes between architectures and even within an architecture (smaller chips may have smaller reorder buffers), its best to "cut all dependencies" at the compiler level, and then simply emit code where the CPU allocates registers in whatever optimal way.
The compiler doesn't care if you have 160-renaming registers (ARM A78), 224-renaming registers (Intel Skylake), or 300+ registers (Intel Icelake). The "xor eax, eax" idiom on every dependency cut emits the optimal code for all CPUs.
----------
You should have a large enough architectural register set to perform the calculations you need... in practice, 16 to 32 registers seem to be enough.
With the "dependency cutting" paradigm (aka: "xor eax, eax" really means malloc-register), your compiler's code will scale to all future processors, no matter the size of the reorder buffer of the particular CPU it ends up running on.
EDIT: It should be noted that on Intel Skylake / AMD Zen, "xor eax, eax" is so well optimized its not even a micro-op. Literally zero uop execution time for that instruction.