Algorithms, Design, Code and more: On Quantization

Monday, November 24, 2025

On Quantization

Quantization technique is employed widely these days to ML models to reduce the numerical precision of the model parameters such as weights. For context:

Typical Llm weight is a floating point number in a FP32 precision, which uses 32-bits.
With quantization to a lower precision Int4, which uses 4-bits, there's 8x saving per weight.

With Models having several billions to trillions of such parameters quantization results in much lower space utilization and storage requirement for the trained model. More importantly, at inference time the lower precision parameters are loaded to the memory, register, gpu much quicker than the corresponding higher precision parameters thereby increasing the inference speed significantly lowering costs, energy utilization, etc. So the benefits compound with every run.

But then again, there are no free lunches. The quality of the results are lower with lower precision quantized models. Leading to a speed, size, cost vs quality tradeoff. There are several use cases (chat, image generation, embedded use in mobile app, etc) where the slightly lower quality outputs may be acceptable, so the quantized model wins. While for deep research, thinking, planning type use cases the full/ high precision model is preferred.

The Keras libary makes it very easy to quantize trained models. Training is in full/ high precision while quantization is done after the model is fully trained. To explain this we return to the the trained Keras Text Classifier Model. In the TestTextClassificationTorch.py ->testQuantizeAndSaveModel() test the trained model is loaded, quantized and saved to an "int4" QUANTIZATION_MODE:

model=keras.models.load_model(SAVE_TO_DIR+'TextClassificationTorchModel.keras')
model.quantize(QUANTIZATION_MODE)

The quantized model can be save and also used for running inferences instead of the full precision model. For inference the same saved vocabulary of the full precision model is used by the quantized model and will have to be loaded as shown in TextClassificationTorchInference.py.

Algorithms, Design, Code and more

Monday, November 24, 2025

On Quantization

No comments:

Post a Comment