it's highly unlikely that the performance will be great on GPU if it generates i...

mcabbott · on April 10, 2022

On the GPU, operations which look like matrix multiplication are indeed quite slow. As you say there are many tricks, and Tullio doesn't (right now) know them. For operations which are more like broadcasting it does much better.

On the CPU, the situation is much better. With some help from LoopVectorization.jl (which optimises micro-kernels) it will often beat OpenBLAS at matrix multiplication. The best-case scenario is an operation which would otherwise be permutedims plus matrix multiplication, for which it will often be several times faster, by fusing these.

The notation above is shared by some other packages. TensorOperations.jl is always decomposes to known kernels (including on the GPU) and OMEinsum.jl usually does so (with a fallback to loops), both more like einsum. TensorCast.jl is more like einops, just notation for writing reshape/permute/slice operations.

adgjlsfhk1 · on April 10, 2022

you might be surprised. I don't know a lot about how it generates gpu kernels, but for CPU it uses LoopVectorization which is able to (among other things) derive optimal gemm kernels automatically. For GPU, I believe tullio used KernelAbstractions.jl to generate the kernels, but I'm less familiar with the details.