ashpil's comments

ashpil · on May 25, 2024

Another alternative that the author omitted is just casting everything to doubles and summing naively (or vectorized) in double precision, then casting the final result back to a float. Would be curious to see how this compares to the other methods and whether it’s on the Pareto frontier.

orlp · on May 25, 2024

I omitted this because you only have this option for f32, not for f64. I only really chose f32 as the focus point of my article because it makes the numbers a bit more readable.

That said, I should have included it, because it is on the Pareto frontier. On my machine it is ~28.9 GB/s, with 0 error (note that this doesn't mean it always produces a correctly-rounded sum, just in this benchmark test input).

I'll add it to the article tomorrow or the day after.

gpderetta · on May 26, 2024

On x86 you have the option of using 80 bit long doubles for the accumulator. Performance is till quite decent. Not sure if rust supports them though.

clausecker · on May 26, 2024

You don't get SIMD with this approach though, so it's about a quarter of the speed of using vectorized arithmetic (assuming an AVX vector of doubles).

bee_rider · on May 25, 2024

Does rust have float128 support? This is probably memory bound anyway, so software (rather than hardware) support might be fine(?).

LegionMammal978 · on May 26, 2024

No, it doesn't. Regardless, without hardware support, adding together intermediate f128s would basically be the same as Kahan summation, except performing even more work to convert the representations around.

zokier · on May 26, 2024

There is some support behind feature flag: https://doc.rust-lang.org/nightly/std/primitive.f128.html

LegionMammal978 · on May 26, 2024

Ah, thank you, my understanding was out of date. It looks like these were implemented only these past few months.

touisteur · on May 26, 2024

On some archs, e.g. non A/H100 (and non A30) NVIDIA GPUs this is a 1:64 slow-down. Avoiding double precision is crucial there...