Enhancement to LLMs to align with expert models paradigm.
- Each expert implemented as a separate Feed Forward Network (FFN) (though other trainable ML models Backprop should work).
- The expert FFNs are introduced in parallel to the existing FFN layer after the Attention Layer.
- Decision to route tokens to the expert is by a router.
- Router is implemented a linear layer followed by a Softmax for probability of each expert, to pick the top few.