Algorithms, Design, Code and more: May 2024

Thursday, May 30, 2024

Each expert implemented as a separate Feed Forward Network (FFN) (though other trainable ML models Backprop should work).
The expert FFNs are introduced in parallel to the existing FFN layer after the Attention Layer.
Decision to route tokens to the expert is by a router.
Router is implemented a linear layer followed by a Softmax for probability of each expert, to pick the top few.