Yes, in fact compilers these days are smart enough to convert people's xor swaps...

wolfgke · on Feb 13, 2017

When I look at the generated code that multiple of the compilers (gcc,clang,icc) generate

  mov     eax, edi
  mov     edi, esi
  mov     esi, eax

I would intuitively use the `xchg` instruction that x86-32/x86-64 provides instead. Is there a specific reason why the compiler(s) decide to generate the mentioned code instead?

Someone · on Feb 13, 2017

When used to swap with data in memory, the reason is that it is faster. xchg is atomic. To do that, it implicitly locks its target address. That makes it slower than the series of moves. http://www.agner.org/optimize/instruction_tables.pdf:

"Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand."

wolfgke · on Feb 13, 2017

I know that. But this is only relevant if you exchange registers with memory and is not of relevance if you exchange two registers. I accept that this is a good point if some variables are moved to the stack because of register spilling or because you want to use the address of the variable (which is not the case here).

So I still stand by my point: What is the reason why the compiler uses `mov` for exchanging two registers here instead of `xchg`?

Someone · on Feb 13, 2017

I think that's because (at least on some CPUs) it takes three macro-operations. http://www.agner.org/optimize/microarchitecture.pdf (section 17.4, page 188):

"Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example:

    ; Example 17.1. AMD instruction breakdown
    xchg  eax, ebx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op
    xchg  ecx, edx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op

This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone."

wolfgke · on Feb 13, 2017

This is indeed a good explanation - I admit I was not aware of this detail of the K8/K10 processors.

JoeAltmaier · on Feb 13, 2017

Some of that is weird placeholders for the debugger, for inserting hook instructions or whatnot?