You're right that parameter updates typically require full state, but during my ...

mirekrusin · on March 23, 2024

Unfortunately it's hardly progress. MoE expert models are still large, have to be trained in usual, linear way, this approach requires training set classification upfront, each expert model is completely independent, each has to relearn concepts, your overall model is as good as dedicated expert, scale is in low numbers ie. 8, not thousands (otherwise you'd have to run inference on beefed up cluster only, experts still have to be loaded when used) etc.