Most of our performance relies on leveraging torch.compile which generates Triton kernels which run fast on CPU and GPU but not MPS since Triton does not support generating Metal kernels. So you lose the nice story of writing low bit code in pure PyTorch but also get it running fast.
In these cases the only path forward we have is writing custom Metal kernels and plugging those in. That work is still ongoing and we'll hopefully have more to share soon.
This might not be the right place for this question but, as someone who has made a couple very modest mps backend contributions, I'm curious why not add metal support to triton (or a fork if openai won't allow it) rather than maintain a whole separate backend?
Mostly comes down to what's fastest to develop, it's faster to write a few custom kernels than it is to develop a new compiler backend
Granted after more upfront effort compilers are just such a significant UX boost that indeed you are making me question why I don't spend more time working on this myself lol