Algorithms, Design, Code and more: Mixture of Experts and Switch Transformer

Tuesday, December 2, 2025

Mixture of Experts and Switch Transformer

Mixture of Experts (MoE) is an innovative horizontal scaling technique employed to the basic Transformer architecture. The Feed Forward (FFN) Layer of the Transformer is replaced with a MoE layer which is a collection of N-Experts (each one a seperate FFN) in parallel. The MoE also includes a Router layer with a gating logic (learnt) to decide the expert(s) to route the token to.

One of the early MoE based Transformers was the Switch Transformer with a MoE routing layer. The Switch Transformer specifically includes logic to enable balancing of token loads across the different Experts in order to prevent hot-spots, where only a few experts end up handling a majority of the tokens. This also leads to a second issue where the other experts remain untrained through training thereby rendering them useless for inference.

There are several sota MoE implementations available on the different ML platforms. The Keras-io examples has an implementation of the Switch Transformer. The code text_classification_switch_transformer_pytorch.py is a PyTorch port of the same code with couple of changes done to make the code modular and resolve issues with super.init() call and position_in_expert.

Further, a much simpler SwitchRouter implementation is done in SwitchTransformerUtil.SimpleSwitchRoute(). The code flow is:

Compute gateLogits, with option to add Noise to load balance during training
Compute weights & selectedExperts indexes of the topK experts
Compute auxLoss to be minimized for balancing load across experts
Finally, for every expert, fetch weights, invoke expert to get the outputs
Also drop tokens beyond expert capacity threshold

Fairly straightforward!

References

https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe?utm_source=publication-search
https://medium.com/@pilliudayaditya1207/understanding-mixture-of-experts-switch-transformers-load-balancing-vs-mixtral-s-natural-balance-25ed528cadfe
https://huggingface.co/blog/NormalUhr/moe-balance
https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Algorithms, Design, Code and more

Tuesday, December 2, 2025

Mixture of Experts and Switch Transformer

No comments:

Post a Comment