In case anyone is wondering why RCP[PS]S and RSQRT[PS]S are specified to give only a relatively small number of digits of accuracy (I think only 7 or 14 bits?) it's because they use the fast inverse square root algorithm in hardware.
The IEEE754 floating point representation gives you an easy way to roughly approximately convert between the log2 of it. The exponent gives you exactly the integer log2 of the number, and the mantissa gives you a fractional linear term that you can drop onto the integer part to make it closer to the actual log2 version. This lets you do some fun things fairly easily.
x -> log2(x) -> -log2(x) -> 1/x
x -> log2(x) -> -0.5 log2(x) -> 1/sqrt(x)
x -> log2(x) -> 0.5 log2(x) -> sqrt(x)
The lossiness of the conversion limits your precision, but there's a lot of times you don't give a shit. So the instructions are still valuable even if they're only approximately correct. For the reciprocal case, it will do something like this:
I'm not sure where the nondeterminism comes in between AMD and Intel. As you can see, with the reciprocal, there's no magic constant like there is with inverse square root. Maybe they're fudging something, or have a different form of Newton's method, I dunno.
Approximations implemented in HW generally use different techniques than ones in SW. They typically start with a lookup table, possibly followed by some other steps. Someone reverse-engineered the old AMD 3DNow implementations here:
It is utterly bizarre to me that a table lookup is preferable to integer subtraction. The f32 registers are already hooked up to i32 ALUs. A table lookup requires die space, while mine requires none, and the i32 subtraction is already heavily optimized.
An initial thought I had that was that their table lookup is to find a magic constant that might nudge the result of the final value in the right direction. (instead of the magic constant 0x7F000000 that my code boils down to.) But that doesn't seem to be what my Kaby Lake is doing.
> The f32 registers are already hooked up to i32 ALUs.
Technically they are. Practically many CPUs, especially older one, have non-trivial latency cost for passing a value between FP and integer ALUs, like couple of cycles each direction.
Ever wondered why there're 3 sets of bitwise instructions, e.g. pandn, andnps, andnpd which do exactly same thing with the bits in these registers? Now you know.
They're basically the same thing, just the integer aliasing approach is a lookup table with one element in it, while an actual lookup table will have more, giving better precision for roughly the same cost (or even better, since you don't need to load the constant into a register).
The IEEE754 floating point representation gives you an easy way to roughly approximately convert between the log2 of it. The exponent gives you exactly the integer log2 of the number, and the mantissa gives you a fractional linear term that you can drop onto the integer part to make it closer to the actual log2 version. This lets you do some fun things fairly easily.
x -> log2(x) -> -log2(x) -> 1/x
x -> log2(x) -> -0.5 log2(x) -> 1/sqrt(x)
x -> log2(x) -> 0.5 log2(x) -> sqrt(x)
The lossiness of the conversion limits your precision, but there's a lot of times you don't give a shit. So the instructions are still valuable even if they're only approximately correct. For the reciprocal case, it will do something like this:
https://godbolt.org/z/n6M4az
I'm not sure where the nondeterminism comes in between AMD and Intel. As you can see, with the reciprocal, there's no magic constant like there is with inverse square root. Maybe they're fudging something, or have a different form of Newton's method, I dunno.