Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?
Instead of computing 8 independent values, compute one with 8x more iterations:
for (int i = 0; i < count * 8; i++) {
v0 += acc * v0;
}
That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.
Instead of computing 8 independent values, compute one with 8x more iterations:
That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.