Algorithms, Design, Code and more: Optimization

Showing posts with label Optimization. Show all posts

Thursday, April 17, 2025

On Quantization

Speed vs Accuracy trade off.
Reduce costs on storage, compute, operations .
Speed up output generation, inference, etc.
Work with lower precision data.
Cast/ map data from Int32, Float32, etc 32-bit or higher precision to lower precision data types such as 16-bit Brain Float (BFloat16) or 4-bit (NFloat)/ int4 or int8, etc.

East mapping Float32 (1-bit Sign, 7-bit Exponent, 23-bit Mantissa) => BFloat16 (1-bit Sign, 7-bit Exponent, 7-bit Mantissa). Just discard the higher 16-bits of mantissa. No overflow!
Straightforward mapping work out max, min, data distribution, mean, variance, etc & then sub-divide into equally sized buckets based on bit size of the lower precision data type. E.g int4 (4-bit) => 2^4 = 16 buckets.
Handle outliers, data skew which can mess up the mapping, yet lead to loss of useful info if discarded randomly.
Work out Bounds wrt Loss of Accuracy.

LLMs, AI/ ML side:

https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs

Lucene, Search side:

https://www.elastic.co/search-labs/blog/scalar-quantization-101
https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene

Wednesday, April 16, 2025

Speculative Decoding

Ensemble of Weak + Strong model
Weak model has a quick first go at generating tokens/ inference (potentials)
Followed by the Strong, but slow model which catches up & uses the outputs of the weak model, samples them, grades them, accepting/ rejecting them to generate the final output
Overall making inferences via LLMs quicker and cheaper

More to follow..

https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/
https://www.baseten.co/blog/a-quick-introduction-to-speculative-decoding/
https://research.google/blog/looking-back-at-speculative-decoding/
https://medium.com/ai-science/speculative-decoding-make-llm-inference-faster-c004501af120

Saturday, March 15, 2025

Scaling Laws

Quick notes around Chinchilla Scaling Law/ Limits & beyond for DeepLearning and LLMs.

Factors

Model size (N)
Dataset size (D)
Training Cost (aka Compute) (C)
Test Cross-entropy loss (L)

The intuitive way,

Larger data will need a larger model, and have higher training cost. In other words, N, D, C all increase together, not necessarily linearly, could be exponential, log-linear, etc.
Likewise Loss is likely to increase for larger datasets. So an inverse relationship between L & D (& the rest).
Tying them into equations would be some constants (scaling, exponential, alpha, beta, etc), unknown for now (identified later).

Beyond common sense, the theoretical foundations linking the factors aren't available right now. Perhaps the nature of the problem is it's hard (NP).

The next best thing then, is to somehow work out the relationships/ bounds empirically. To work with existing Deep Learning models, LLMs, etc using large data sets spanning TB/ PB of data, Trillions of parameters, etc using large compute budget cumulatively spanning years.

Papers by Hestness & Narang, Kaplan, Chinchilla are all attempts along the empirical route. So are more recent papers like Mosaic, DeepSeek, MoE, Llam3, Microsoft among many others.

Key take away being,

The scale & bounds are getting larger over time.
Models from a couple of years back, are found to be grossly under-trained in terms of volumes of training data used. They should have been trained on an order of magnitude larger training data for an optimal training, without risk of overfitting.
Conversely, the previously used data volumes are suited to much smaller models (SLMs), with inference capabilities similar to those older LLMs.

References

https://en.wikipedia.org/wiki/Neural_scaling_law
https://lifearchitect.ai/chinchilla/
https://medium.com/@raniahossam/chinchilla-scaling-laws-for-large-language-models-llms-40c434e4e1c1
https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
https://medium.com/nlplanet/two-minutes-nlp-scaling-laws-for-neural-language-models-add6061aece7
https://lifearchitect.ai/the-sky-is-bigger/

Monday, April 1, 2024

Gurobi Optimizer

Gurobi stack consists of various modules in Python (Gurobipy) & other languages for solving Optimization problems. Think of scheduling, routing, cost minimization, profit maximization, flow decision, OR, assignment and so on. They have all the classical Linear Programming, ILP, Greedy, Constrained Optimization type algos properly implemented & ready for use at scale.

Off late they are fusing Mathematical Optimizations with AI to yield much better modules. Their stack includes Neural Nets, DNN, Differential Programming, Simulators, Reinforcement Learning and all the other tools required to fuse ML & Optimization. They also have hardware & cloud offerings now. These can be good places to start of by uploading modules that need to be optimized on demand. The output results can be integrated within applications/ workflows.