aaa370's comments

aaa370 · on July 26, 2024

I gotta see it to believe it ;)

ap4 · on Aug 1, 2024

For all doubters, I open-sourced it: https://github.com/arekpaterek/Faster_SGEMM_CUDA

ap4 · on July 27, 2024

Believe it or not.

On a 4090 gpu, average of 20 runs of SGEMM_CUDA:

  size    tflops_cublas  tflops_my  diff
  4096²   50.8-50.9      61.8       +21%
  8192²   56.3-56.4      67.1       +19%
  16384²  53.6           66.7       +24%

I guess the right thing to do now would be to hire a B2B salesman and figure out, which company needs it.

aaa370 · on July 26, 2024

Another point to consider here is that this project of writing a cuBLAS level GEMM kernel becomes much more challenging if you are doing it with fp16, and are thus competing with the cuBLAS kernels that use tensor cores. The (theoretical) arithmetic throughput of tensor cores is ~8x higher as compared to fp32 math on the Turing arch, I dont know off the top of my head but I think this ratio is the same or greater for Ampere/Hopper tensor cores.

This makes the project proportionally harder in my opinion because you need to be that much more efficient with moving data through the memory hierarchy. With tensor cores, to get anywhere close to cuBLAS, you need to start with something like the most efficient kernel in simon's article, and then do stuff like shared memory swizzling, async global memory copies, double buffering, and writing a really efficient kernel epilogue to accumulate the C matrix into the product.

I came across this article a while ago and it inspired me to take a stab at this^, and as of now I have gotten to ~80% of the cuBLAS tensor core performance where the kernel is mostly compute bound, and I am close to giving up on the last ~20%, because I think I may need to write the inner loop in SASS to make sure the instruction mix between shared memory loads, mma instructions, and synchronizations is perfectly balanced so that none of the hardware pipelines get overloaded (see link below), and I have enough compassion for myself to not spend my free time doing stuff like that :). There are also certain things implemented in CUTLASS that seem important (look up serpentine traversal) but NVIDIA engineers wont talk about the hardware details required to understand why this helps.

Article on this is forthcoming

https://github.com/NervanaSystems/maxas/wiki/SGEMM

gregjm · on July 28, 2024

Re: serpentine traversal, this has to do with the .reuse suffix applied to register operands as mentioned in your link. We don’t really have control over it because it’s happening inside of ptxas during SASS generation, but when CUTLASS does serpentine traversal they’re suggesting an order of MMA instruction issues that would result in at least one operand being reused from one instruction to the next— clever stuff.

I’d be so happy if SASS were documented and ptxas were open source, sometimes I spend entire days going through whitepapers and various sources of online documentation to get more hardware details…