Friday, December 19, 2025

Reinforcement Learning

An important ML training paradigm is Reinforcement Learning (RL). RL models rely on a reward value generated at the end of each training run/ epoch to update the parameters (weights) of the model. This is different from the other ML methods such as Supervised Learning where labelled data/ examples are given from which the models learns. It's is also different from the Unsupervised Learning approach where inherent features of the unlabeled data are explored used by the model through the learning phase to identify clusters, etc.    

The keras-io examples has some RL implementations such as actor_critic, ppo, etc. All of them work solely with the TensorFlow (tf) backend. In keras_io_examples_rl these have been ported to the Torch/ PyTorch backend. The typical changes include:   

  • Torch Imports 
  • Use torch specific Optimizer - torch.optim.Adam
    • deep_q_network_breakout_pytorch () requires grad_clipping, in torch done before optimizer.step() 
  • Gradient computations in torch 
    • Replace tf GradientTape with torch autograd 
    • Disable gradient globally torch.set_grad_enabled(False)
    • Enable autograd within specific flows/ methods where needed
    • Call loss.backward(), optimizer.step() for backpropagation
  • Few torch specific tensor & function changes/ wrappers  

The ported pytorch compatible files are:


References

  • http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf
  • https://hal.inria.fr/hal-00840470/document
  • https://link.springer.com/content/pdf/10.1007/BF00992698.pdf
  • https://www.semanticscholar.org/paper/Human-level-control-through-deep-reinforcement-Mnih-Kavukcuoglu/340f48901f72278f6bf78a04ee5b01df208cc508
  • Continuous control with deep reinforcement learning: https://arxiv.org/abs/1509.02971)
  • Deep Deterministic Policy Gradient (DDPG) 
  • https://gymnasium.farama.org/
  • Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

Thursday, December 4, 2025

Drift Detection across Distinct Reviews Datasets

 Model Drift leads to invalid results from AI/ ML inference models in production. There could be various causes of Drift such as conceptual drift, structural changes and ingestion pipeline issues with upstream data sources, domain change, prompt injections and other model exploits, etc. These lead to the AI/ ML model that was trained on a certain kind(s) of data having to run inferences on completely different drifted data which causes to wrong/ incorrect results. So Drift detection (periodical, near real-time, etc) is crucial for any productionized model. 

As mentioned previously Evidently is a handy library to do drift detection. Evidently has features like Metrics, Descriptors, Eval, etc that can be plugged in to detect drift in the current data vis-a-vis a reference baseline data (~training data).

In the DriftTextReviews.py Drift detection is done for an existing Text Classification model in PyTorch originally trained on an Imdb Movie's review dataset.  For Reference data a sample of the same Imdb Movie data is used. For Current, data from a completely different domain of Code Reviews is used. As expected, there was significant drift detected for these two datasets that belong to two completely different domains. Evidently reports below make the drift evidently clear!

  • The characteristic words have changed across the two domains. While the movie domain includes words like frame, character, minutes, etc, the coding domain has words like readable, test, method, etc. 
  • In terms of Length of review text, Imdb reviews are much much longer and include many more words than the Code reviews. These word length and count features hooked in as Descriptors are duly detected and shown in the reports.
  • Interestingly, the Label either Positive (1) or Negative (0) shows no Drift. Across both datasets equal no of the two classes Positive & Negative is seen.

 









 



 

 

 

 

 


Fig 1: Drift Review Length & Word Count

Fig 2: No Drift in Label

Fig 3: Characteristic Words - CurrentFig 4: Characteristic Words - Reference

Tuesday, December 2, 2025

Mixture of Experts and Switch Transformer

Mixture of Experts (MoE) is an innovative horizontal scaling technique employed to the basic Transformer architecture. The Feed Forward (FFN) Layer of the Transformer is replaced with a MoE layer which is a collection of N-Experts (each one a seperate FFN) in parallel. The MoE also includes a Router layer with a gating logic (learnt) to decide the expert(s) to route the token to.

One of the early MoE based Transformers was the Switch Transformer (https://arxiv.org/abs/2101.03961) with a MoE routing layer. The Switch Transformer specifically includes logic to enable balancing of token loads across the different Experts in order to prevent hot-spots where only a few experts end up handling a majority of the tokens. This also leads to a second issue with the other experts remain untrained through training therby rendering them useless for inference.

There are several sota MoE implementations available on the different ML platforms. The Keras-io examples has one Switch Transformer. The code text_classification_switch_transformer_pytorch.py is a PyTorch port of the same code with couple of changes done to make the code modular and resolve issues with super.init call and position_in_expert.

Further, a much simpler SwitchRouter combined implementation is done in SwitchTransformerUtil.SimpleSwitchRoute(). The code flow is:

  • Compute gateLogits, with option to add Noise to load balance during training
  • Compute weights & selectedExperts indexes of the topK experts 
  • Compute auxLoss to be minimized for balancing load across experts
  • Finally, for every expert, fetch weights, invoke expert to get the outputs
  • Also drop tokens beyond expert capacity threshold

Fairly straightforward!

References

  • https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe?utm_source=publication-search
  • https://medium.com/@pilliudayaditya1207/understanding-mixture-of-experts-switch-transformers-load-balancing-vs-mixtral-s-natural-balance-25ed528cadfe
  • https://huggingface.co/blog/NormalUhr/moe-balance
  • https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Monday, December 1, 2025

Evidently - Model Drift

Evidently is a Python library which to evaluate and monitory AI/ ML project. Evidently can be used to detect Drift in models seen over time. 

Reports from running Evidently Metrics Cookbook gives a good feel of its capabilities and features. More to follow...

 



Fig 1: Drift Report Fig 2: Generator Drift Report




 

References

  • https://www.nannyml.com/blog/monitoring-computer-vision
  • https://www.labellerr.com/blog/computer-vision-data-drift/
  • https://blog.roboflow.com/monitor-data-drift-computer-vision/
  • https://www.nannyml.com/blog/monitoring-computer-vision
  • https://nexla.com/ai-infrastructure/data-drift/
  • https://cobusgreyling.medium.com/llm-drift-prompt-drift-chaining-cascading-fa8fbf67c0fd
  • https://www.splunk.com/en_us/blog/learn/model-drift.html
  • https://en.wikipedia.org/wiki/Concept_drift
  • https://arize.com/model-drift/