Another alternative that the author omitted is just casting everything to doubles and summing naively (or vectorized) in double precision, then casting the final result back to a float. Would be curious to see how this compares to the other methods and whether it’s on the Pareto frontier.
I omitted this because you only have this option for f32, not for f64. I only really chose f32 as the focus point of my article because it makes the numbers a bit more readable.
That said, I should have included it, because it is on the Pareto frontier. On my machine it is ~28.9 GB/s, with 0 error (note that this doesn't mean it always produces a correctly-rounded sum, just in this benchmark test input).
I'll add it to the article tomorrow or the day after.
No, it doesn't. Regardless, without hardware support, adding together intermediate f128s would basically be the same as Kahan summation, except performing even more work to convert the representations around.