- Ensemble of Weak + Strong model
- Weak model has a quick first go at generating tokens/ inference (potentials)
- Followed by the Strong, but slow model which catches up & uses the outputs of the weak model, samples them, grades them, accepting/ rejecting them to generate the final output
- Overall making inferences via LLMs quicker and cheaper
More to follow..
- https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/
- https://www.baseten.co/blog/a-quick-introduction-to-speculative-decoding/
- https://research.google/blog/looking-back-at-speculative-decoding/
- https://medium.com/ai-science/speculative-decoding-make-llm-inference-faster-c004501af120
No comments:
Post a Comment