Algorithms, Design, Code and more: Large Language Model

Showing posts with label Large Language Model. Show all posts

Thursday, May 30, 2024

Each expert implemented as a separate Feed Forward Network (FFN) (though other trainable ML models Backprop should work).
The expert FFNs are introduced in parallel to the existing FFN layer after the Attention Layer.
Decision to route tokens to the expert is by a router.
Router is implemented a linear layer followed by a Softmax for probability of each expert, to pick the top few.