I've heard from a fairly drunk Intel chip designer that CISC continues to make sense in gate counts where full OoO cores make sense.
1) You're almost certainly decoding into u-ops even if you chose RISC because you'll have uarch features like a seperate pipe line for the AGU and the load store queues, atomics that have to wait to go out all the way to L2 for fairly arbitrary lengths of time, etc. You can see this in cores as simple as BOOM, and it's opinion of the RISC-V community that macro op fusion of prescribed clauses is the way to go.
2) These decoders are a drop in the bucket when compared to OoO circuitry and power budget.
3) The complex addressing modes and memory RMW operands are effectively a way to address physical registers while consuming no architectural registers, and very few bits of I$. Yes, x86 is ancient and isn't as optimal as it could be from a huffman encoding perspective (hlt is a single byte opcode!), but it's pretty damn good overall. Better than AArch64 at code density and therefore I$ pressure. As an aside, I'm sorta curious what a CISC-V would look like, and if it would set a new bar.
> isn't as optimal as it could be from a huffman encoding perspective
I've been toying with the idea of literally having decoding as decompression, where there's a special instruction to change the dictionary. I guess this'd be tantamount to implementing the decoder as an FPGA, but I'm hoping there's some reasonable version where a fairly non-dense "base encoding" becomes a pretty optimal bit stream.
I've actually played with that idea as well in the past, as a mechanism for emulating other architectures. Who wouldn't love a RISCV that you can turn into a passable x86 or M68K or whatever? In the past too, there were user programmable uCode machines intended to be a generic platform for pretty arbitrary ISAs, so it's not an entirely crazy idea. I eventually came to the conclusion that programmable fabrics in the critical path like instruction decoders on modern processors didn't make sense from a timing perspective, and a classic RISC (or VLIW like Transmeta/Denver) + JIT continued to make more sense. In hindsight I believe you can see this in x86 cores where the microcode ROM is pretty much only executed in already slow paths, and the patch RAM is even more anemic. I'd imagine you'd hit the same issues.
That being said, my experiments were hardly conclusive and I'd absolutely love to be proven wrong.
This is essentially what RISC-V does with its "Compressed" instruction set, except without the dictionary switching. They pulled a bunch of statistics over real-world machine code, ran it through compression, then reverse-engineered that compression to make something a bit more sensible to a compiler writer. I think this will work out vastly better than the haphazard patching of e.g. Thumb on ARM.
More specifically a core where the heart is being sequenced by Tomasulo's algorithm, and probably a large bypass network linking the functional units together.
1) You're almost certainly decoding into u-ops even if you chose RISC because you'll have uarch features like a seperate pipe line for the AGU and the load store queues, atomics that have to wait to go out all the way to L2 for fairly arbitrary lengths of time, etc. You can see this in cores as simple as BOOM, and it's opinion of the RISC-V community that macro op fusion of prescribed clauses is the way to go.
2) These decoders are a drop in the bucket when compared to OoO circuitry and power budget.
3) The complex addressing modes and memory RMW operands are effectively a way to address physical registers while consuming no architectural registers, and very few bits of I$. Yes, x86 is ancient and isn't as optimal as it could be from a huffman encoding perspective (hlt is a single byte opcode!), but it's pretty damn good overall. Better than AArch64 at code density and therefore I$ pressure. As an aside, I'm sorta curious what a CISC-V would look like, and if it would set a new bar.