When I look at the generated code that multiple of the compilers (gcc,clang,icc) generate
mov eax, edi
mov edi, esi
mov esi, eax
I would intuitively use the `xchg` instruction that x86-32/x86-64 provides instead. Is there a specific reason why the compiler(s) decide to generate the mentioned code instead?
When used to swap with data in memory, the reason is that it is faster. xchg is atomic. To do that, it implicitly locks its target address. That makes it slower than the series of moves. http://www.agner.org/optimize/instruction_tables.pdf:
"Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand."
I know that. But this is only relevant if you exchange registers with memory and is not of relevance if you exchange two registers.
I accept that this is a good point if some variables are moved to the stack because of register spilling or because you want to use the address of the variable (which is not the case here).
So I still stand by my point: What is the reason why the compiler uses `mov` for exchanging two registers here instead of `xchg`?
"Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example:
; Example 17.1. AMD instruction breakdown
xchg eax, ebx ; Vector path, 3 ops
nop ; Direct path, 1 op
xchg ecx, edx ; Vector path, 3 ops
nop ; Direct path, 1 op
This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone."