Algorithms, Design, Code and more: Modelling

Showing posts with label Modelling. Show all posts

Thursday, December 4, 2025

Drift Detection across Distinct Reviews Datasets

Model Drift leads to invalid results from AI/ ML inference models in production. There could be various causes of Drift such as conceptual drift, structural changes and ingestion pipeline issues with upstream data sources, domain change, prompt injections and other model exploits, etc. These lead to the AI/ ML model that was trained on a certain kind(s) of data having to run inferences on a different (drifted) dataset which causes to wrong/ incorrect results. So Drift detection (periodical, near real-time, etc) is crucial for any productionized model.

As mentioned previously Evidently is a handy library to do drift detection. Evidently has features like Metrics, Descriptors, Eval, etc that can be plugged in to detect drift in the current data vis-a-vis a reference baseline data (~training data).

In the DriftTextReviews.py Drift detection is done for an existing Text Classification model in PyTorch originally trained on an Imdb Movie's review dataset. For Reference data a sample of the same Imdb Movie data is used. For Current, data from a completely different domain of Code Reviews is used. As expected, significant drift was detected for these two review datasets from two completely different domains. Evidently reports below make the drift evidently clear!

The characteristic words have changed across the two domains. While the movie domain includes words like frame, character, minutes, etc, the coding domain has words like readable, test, method, etc.
In terms of Length of the text, Imdb reviews are much much longer and include many more words than the Code reviews. These word length and count features hooked in as Descriptors are duly detected and shown in the reports.
Interestingly, the Labels Positive (1)/ Negative (0) show no Drift. Across both datasets an equal no of the Positive/ Negative Labeles is seen.


Fig 1: Drift Review Length & Word Count	Fig 2: No Drift in Label

Fig 3: Characteristic Words - Current	Fig 4: Characteristic Words - Reference

Tuesday, December 2, 2025

Mixture of Experts and Switch Transformer

Mixture of Experts (MoE) is an innovative horizontal scaling technique employed to the basic Transformer architecture. The Feed Forward (FFN) Layer of the Transformer is replaced with a MoE layer which is a collection of N-Experts (each one a seperate FFN) in parallel. The MoE also includes a Router layer with a gating logic (learnt) to decide the expert(s) to route the token to.

One of the early MoE based Transformers was the Switch Transformer with a MoE routing layer. The Switch Transformer specifically includes logic to enable balancing of token loads across the different Experts in order to prevent hot-spots, where only a few experts end up handling a majority of the tokens. This also leads to a second issue where the other experts remain untrained through training thereby rendering them useless for inference.

There are several sota MoE implementations available on the different ML platforms. The Keras-io examples has an implementation of the Switch Transformer. The code text_classification_switch_transformer_pytorch.py is a PyTorch port of the same code with couple of changes done to make the code modular and resolve issues with super.init() call and position_in_expert.

Further, a much simpler SwitchRouter implementation is done in SwitchTransformerUtil.SimpleSwitchRoute(). The code flow is:

Compute gateLogits, with option to add Noise to load balance during training
Compute weights & selectedExperts indexes of the topK experts
Compute auxLoss to be minimized for balancing load across experts
Finally, for every expert, fetch weights, invoke expert to get the outputs
Also drop tokens beyond expert capacity threshold

Fairly straightforward!

References

https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe?utm_source=publication-search
https://medium.com/@pilliudayaditya1207/understanding-mixture-of-experts-switch-transformers-load-balancing-vs-mixtral-s-natural-balance-25ed528cadfe
https://huggingface.co/blog/NormalUhr/moe-balance
https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Monday, December 1, 2025

Evidently - Model Drift

Evidently is a Python library which to evaluate and monitory AI/ ML project. Evidently can be used to detect Drift in models seen over time.

Reports from running Evidently Metrics Cookbook gives a good feel of its capabilities and features. More to follow...


Fig 1: Drift Report	Fig 2: Generator Drift Report

References

https://www.nannyml.com/blog/monitoring-computer-vision
https://www.labellerr.com/blog/computer-vision-data-drift/
https://blog.roboflow.com/monitor-data-drift-computer-vision/
https://www.nannyml.com/blog/monitoring-computer-vision
https://nexla.com/ai-infrastructure/data-drift/
https://cobusgreyling.medium.com/llm-drift-prompt-drift-chaining-cascading-fa8fbf67c0fd
https://www.splunk.com/en_us/blog/learn/model-drift.html
https://en.wikipedia.org/wiki/Concept_drift
https://arize.com/model-drift/

Monday, November 24, 2025

Model Quantization in Keras

Quantization technique is employed widely these days to ML models to reduce the numerical precision of the model parameters such as weights. For context:

Typical Llm weight is a floating point number in a FP32 precision, which uses 32-bits.
With quantization to a lower precision Int4, which uses 4-bits, there's 8x saving per weight.

With Models having several billions to trillions of such parameters quantization results in much lower space utilization and storage requirement for the trained model. More importantly, at inference time the lower precision parameters are loaded to the memory, register, gpu, tpu, etc much quicker than the corresponding higher precision parameters thereby increasing the inference speed significantly lowering costs, energy utilization, etc. So the benefits compound with every run.

But then again, there are no free lunches. The quality of the results are lower with lower precision quantized models. Leading to a speed-size-cost vs quality tradeoff. There are several use cases (chat, image generation, embedded use in mobile app, etc) where the slightly lower quality outputs may be acceptable, so the quantized model wins. Similarly with object classifiers sometimes a confidence score of 88% is as good as a higher precision 88.238871%! While for deep research, thinking, planning type use cases the full/ high precision model is preferred.

The Keras libary makes it very easy to quantize trained models. Training is in full/ high precision while quantization is done after the model is fully trained. To explain this we return to the the trained Keras Text Classifier Model. In the TestTextClassificationTorch.py ->testQuantizeAndSaveModel() the trained model is loaded, quantized and saved to an "int4" QUANTIZATION_MODE:

model=keras.models.load_model(SAVE_TO_DIR+'TextClassificationTorchModel.keras')
model.quantize(QUANTIZATION_MODE)

After that the quantized model can be saved and used for running inferences in place of the full precision model. For inference the same saved vocabulary of the full precision model is used by the quantized model and will need to be loaded as shown in TextClassificationTorchInference.py.