you might be surprised. I don't know a lot about how it generates gpu kernels, but for CPU it uses LoopVectorization which is able to (among other things) derive optimal gemm kernels automatically. For GPU, I believe tullio used KernelAbstractions.jl to generate the kernels, but I'm less familiar with the details.