I think the MoE models are trained together just like any other network though, including the dispatcher layer that has to learn which "expert" route each token to. Perhaps you could do some kind of technically worse model architecture that is trained separately and then a more complex dispatcher that then learns to utilize the individually trained experts as best as it can?