In case anyone is wondering why RCP[PS]S and RSQRT[PS]S are specified to give on...

cwzwarich · on July 19, 2020

Approximations implemented in HW generally use different techniques than ones in SW. They typically start with a lookup table, possibly followed by some other steps. Someone reverse-engineered the old AMD 3DNow implementations here:

https://pdf.sciencedirectassets.com/272990/1-s2.0-S157106610...

nwallin · on July 20, 2020

It is utterly bizarre to me that a table lookup is preferable to integer subtraction. The f32 registers are already hooked up to i32 ALUs. A table lookup requires die space, while mine requires none, and the i32 subtraction is already heavily optimized.

An initial thought I had that was that their table lookup is to find a magic constant that might nudge the result of the final value in the right direction. (instead of the magic constant 0x7F000000 that my code boils down to.) But that doesn't seem to be what my Kaby Lake is doing.

This is gonna bug me all night, I know it.

Const-me · on July 20, 2020

> The f32 registers are already hooked up to i32 ALUs.

Technically they are. Practically many CPUs, especially older one, have non-trivial latency cost for passing a value between FP and integer ALUs, like couple of cycles each direction.

Ever wondered why there're 3 sets of bitwise instructions, e.g. pandn, andnps, andnpd which do exactly same thing with the bits in these registers? Now you know.

creato · on July 20, 2020

They're basically the same thing, just the integer aliasing approach is a lookup table with one element in it, while an actual lookup table will have more, giving better precision for roughly the same cost (or even better, since you don't need to load the constant into a register).