- Speed vs Accuracy trade off.
- Reduce costs on storage, compute, operations .
- Speed up output generation, inference, etc.
- Work with lower precision data.
- Cast/ map data from Int32, Float32, etc 32-bit or higher precision to lower precision data types such as 16-bit Brain Float (BFloat16) or 4-bit (NFloat)/ int4 or int8, etc.
- East mapping Float32 (1-bit Sign, 7-bit Exponent, 23-bit Mantissa) => BFloat16 (1-bit Sign, 7-bit Exponent, 7-bit Mantissa). Just discard the higher 16-bits of mantissa. No overflow!
- Straightforward mapping work out max, min, data distribution, mean, variance, etc & then sub-divide into equally sized buckets based on bit size of the lower precision data type. E.g int4 (4-bit) => 2^4 = 16 buckets.
- Handle outliers, data skew which can mess up the mapping, yet lead to loss of useful info if discarded randomly.
- Work out Bounds wrt Loss of Accuracy.
LLMs, AI/ ML side:
- https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs
Lucene, Search side:
- https://www.elastic.co/search-labs/blog/scalar-quantization-101
- https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene