Thursday, May 30, 2024

Mixture of Experts (MoE) Architecture

Enhancement to LLMs to align with expert models paradigm. 

  • Each expert implemented as a separate Feed Forward Network (FFN) (though other trainable ML models Backprop should work).
  • The expert FFNs are introduced in parallel to the existing FFN layer after the Attention Layer.
  • Decision to route tokens to the expert is by a router. 
  • Router is implemented a linear layer followed by a Softmax for probability of each expert, to pick the top few.