Mixture of Experts (MoE) is an innovative horizontal scaling technique employed to the basic Transformer architecture. The Feed Forward (FFN) Layer of the Transformer is replaced with a MoE layer which is a collection of N-Experts (each one a seperate FFN) in parallel. The MoE also includes a Router layer with a gating logic (learnt) to decide the expert(s) to route the token to.
One of the early MoE based Transformers was the Switch Transformer (https://arxiv.org/abs/2101.03961) with a MoE routing layer. The Switch Transformer specifically includes logic to enable balancing of token loads across the different Experts in order to prevent hot-spots where only a few experts end up handling a majority of the tokens. This also leads to a second issue with the other experts remain untrained through training therby rendering them useless for inference.
There are several sota MoE implementations available on the different ML platforms. The Keras-io examples has one Switch Transformer. The code text_classification_switch_transformer_pytorch.py is a PyTorch port of the same code with couple of changes done to make the code modular and resolve issues with super.init call and position_in_expert.
Further, a much simpler SwitchRouter combined implementation is done in SwitchTransformerUtil.SimpleSwitchRoute(). The code flow is:
- Compute gateLogits, with option to add Noise to load balance during training
- Compute weights & selectedExperts indexes of the topK experts
- Compute auxLoss to be minimized for balancing load across experts
- Finally, for every expert, fetch weights, invoke expert to get the outputs
- Also drop tokens beyond expert capacity threshold
Fairly straightforward!
References
- https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe?utm_source=publication-search
- https://medium.com/@pilliudayaditya1207/understanding-mixture-of-experts-switch-transformers-load-balancing-vs-mixtral-s-natural-balance-25ed528cadfe
- https://huggingface.co/blog/NormalUhr/moe-balance
- https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts