Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You're right that parameter updates typically require full state, but during my PhD I've explored some possibilities to address this limitation (unfortunately, my ideas didn't pan out in the time I had). That said, there is research that has explored this topic and made some progress, such as this paper:

https://arxiv.org/abs/2303.14177



Unfortunately it's hardly progress. MoE expert models are still large, have to be trained in usual, linear way, this approach requires training set classification upfront, each expert model is completely independent, each has to relearn concepts, your overall model is as good as dedicated expert, scale is in low numbers ie. 8, not thousands (otherwise you'd have to run inference on beefed up cluster only, experts still have to be loaded when used) etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: