Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have you tried reducing the register count in your FP32 FMA test by increasing the iteration count and reducing the number of values computed per loop?

Instead of computing 8 independent values, compute one with 8x more iterations:

    for (int i = 0; i < count * 8; i++) {
        v0 += acc * v0; 
    }
That plus inlining the iteration count so the compiler can unroll the loop might help get closer to SOL.


The problem is loop overhead matters on AMD, because AMD's compiler doesn't unroll the loop. Nvidia's does, so it doesn't matter for them.


unroll with #pragma unroll?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: