Showing posts with label GenAI. Show all posts
Showing posts with label GenAI. Show all posts

Thursday, December 4, 2025

Drift Detection across Distinct Reviews Datasets

 Model Drift leads to invalid results from AI/ ML inference models in production. There could be various causes of Drift such as conceptual drift, structural changes and ingestion pipeline issues with upstream data sources, domain change, prompt injections and other model exploits, etc. These lead to the AI/ ML model that was trained on a certain kind(s) of data having to run inferences on completely different drifted data which causes to wrong/ incorrect results. So Drift detection (periodical, near real-time, etc) is crucial for any productionized model. 

As mentioned previously Evidently is a handy library to do drift detection. Evidently has features like Metrics, Descriptors, Eval, etc that can be plugged in to detect drift in the current data vis-a-vis a reference baseline data (~training data).

In the DriftTextReviews.py Drift detection is done for an existing Text Classification model in PyTorch originally trained on an Imdb Movie's review dataset.  For Reference data a sample of the same Imdb Movie data is used. For Current, data from a completely different domain of Code Reviews is used. As expected, there was significant drift detected for these two datasets that belong to two completely different domains. Evidently reports below make the drift evidently clear!

  • The characteristic words have changed across the two domains. While the movie domain includes words like frame, character, minutes, etc, the coding domain has words like readable, test, method, etc. 
  • In terms of Length of review text, Imdb reviews are much much longer and include many more words than the Code reviews. These word length and count features hooked in as Descriptors are duly detected and shown in the reports.
  • Interestingly, the Label either Positive (1) or Negative (0) shows no Drift. Across both datasets equal no of the two classes Positive & Negative is seen.

 









 



 

 

 

 

 


Fig 1: Drift Review Length & Word Count

Fig 2: No Drift in Label

Fig 3: Characteristic Words - CurrentFig 4: Characteristic Words - Reference

Tuesday, December 2, 2025

Mixture of Experts and Switch Transformer

Mixture of Experts (MoE) is an innovative horizontal scaling technique employed to the basic Transformer architecture. The Feed Forward (FFN) Layer of the Transformer is replaced with a MoE layer which is a collection of N-Experts (each one a seperate FFN) in parallel. The MoE also includes a Router layer with a gating logic (learnt) to decide the expert(s) to route the token to.

One of the early MoE based Transformers was the Switch Transformer (https://arxiv.org/abs/2101.03961) with a MoE routing layer. The Switch Transformer specifically includes logic to enable balancing of token loads across the different Experts in order to prevent hot-spots where only a few experts end up handling a majority of the tokens. This also leads to a second issue with the other experts remain untrained through training therby rendering them useless for inference.

There are several sota MoE implementations available on the different ML platforms. The Keras-io examples has one Switch Transformer. The code text_classification_switch_transformer_pytorch.py is a PyTorch port of the same code with couple of changes done to make the code modular and resolve issues with super.init call and position_in_expert.

Further, a much simpler SwitchRouter combined implementation is done in SwitchTransformerUtil.SimpleSwitchRoute(). The code flow is:

  • Compute gateLogits, with option to add Noise to load balance during training
  • Compute weights & selectedExperts indexes of the topK experts 
  • Compute auxLoss to be minimized for balancing load across experts
  • Finally, for every expert, fetch weights, invoke expert to get the outputs
  • Also drop tokens beyond expert capacity threshold

Fairly straightforward!

References

  • https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe?utm_source=publication-search
  • https://medium.com/@pilliudayaditya1207/understanding-mixture-of-experts-switch-transformers-load-balancing-vs-mixtral-s-natural-balance-25ed528cadfe
  • https://huggingface.co/blog/NormalUhr/moe-balance
  • https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Monday, December 1, 2025

Evidently - Model Drift

Evidently is a Python library which to evaluate and monitory AI/ ML project. Evidently can be used to detect Drift in models seen over time. 

Reports from running Evidently Metrics Cookbook gives a good feel of its capabilities and features. More to follow...

 



Fig 1: Drift Report Fig 2: Generator Drift Report




 

References

  • https://www.nannyml.com/blog/monitoring-computer-vision
  • https://www.labellerr.com/blog/computer-vision-data-drift/
  • https://blog.roboflow.com/monitor-data-drift-computer-vision/
  • https://www.nannyml.com/blog/monitoring-computer-vision
  • https://nexla.com/ai-infrastructure/data-drift/
  • https://cobusgreyling.medium.com/llm-drift-prompt-drift-chaining-cascading-fa8fbf67c0fd
  • https://www.splunk.com/en_us/blog/learn/model-drift.html
  • https://en.wikipedia.org/wiki/Concept_drift
  • https://arize.com/model-drift/ 

Saturday, November 29, 2025

Fine Tuning Text Classification Model

Fine tuning is a technique employed to a base fully trained model (foundation) and to retraining/ repurposing it to meet some different objective(s). The key aspect of fine tuning being that it is not a complete/ full retraining of the base model. It's done on a much smaller training budget keeping the weights of the original model intact, and bringing in a much smaller additional set of trainable weights known as adapters. 

These adapter weights are typically of Low Rank matrices thus the name LoRA (Low Rank Adapters). With every round of training only these LoRA weights get updated which the weights from the base model stay frozen. Since final weights are additive so the corresponding fine trained LoRA model equation:

    output = f(W_base*x + b_base + B*A*x), where for any given input x

    W_base, b_base: Base model weights & bias which remain fixed
    B, A: Low Rank Adapter weights of a small rank (r), which are trained during fine tuning 
     f: Activation Function 

In the example TextClassificationFineTuningLora.py the working of the LoRA adapter for fine tuning a Text Classification model is demonstrated.

Fine Tuning Objective

Fine Tuning Details

  • The base model had 2.67 Mn total parameters of which 8.86 Lakh paramters were trainable. For fine tuning these 8.86 Lakh parameters are all frozen.
  • The LoRA model is employed to every trainable layer of the base model. Each trainable layer of the base model is set to enable_lora(rank=4). This results in total trainable parameters of just ~30.6K.
  • After fine tuning the model is able to identify Exaggerations with an accuracy in the high 90's.

Friday, November 28, 2025

Knowledge Distillation

Knowledge distillation from a trained large Teacher model to a smaller Student model is a very popular technique in the ML. Distillation helps to train a Student model which despite being much smaller and compressed shows performance comparable to the larger Teacher model.

The other advantage of Distillation is that the Student model requires a much smaller set of labelled training data (<10%) since it's essentially trying to match the output of the Teacher during training. The Distillation loss is a function of the difference between the prediction of the Student (y_pred) & the Teacher models (teacher_pred) for every training input (x). Kullback-Leibler divergence (KLDivergence) loss between student_pred (y_pred) & teacher_pred is a common pick for the Distillation loss.

For a working example of Distillation refer to TextClassificationDistillation.py which is distilled from a Keras Text Classification model in Torch. The original Text Classification Teacher model had several Convolution layers which have been replaced by a Dense layer. Also the Input Embedding layer's ouput dimension has been reduced from 128 to 32. 

The original Text Classification model (Teacher) had ~2.67 Mn parameters (8.9 Lakh trainable) and was trained with 25K data samples. The distilled Student model has only ~1.6 Lakh parameters (~18%) and was trained using 2.5K samples (~10%). In terms of the size of the saved models the Teacher model is 10.2MB vs 0.6 MB of the student. There was only a marginal 4% drop in accuracy seen with the Student model on the held-out test data.

Keras Text Classification - Teacher Keras Text Classification - Student
 Fig 1: Text Classification - Teacher Model

Fig 2: Keras Text Classification - Student Model

Wednesday, November 26, 2025

Explainable AI

With widespread adoption of large Machine Learning (ML) models all over, there's a real need for understanding the workings of the models. Otherwise the model just appears to be a black-box doing its thing without the end user really knowing why's behind the models responses, choices, decisions, etc. Looking inside the model - the white-box approach - while possible is simply not practical for 99.99..9% users. 

Local Interpretable Model-Agnostic Explanations (LIME) & Shapley Additive Explanations (SHAP) are two black-box techniques that help explaining the workings of such  models. The key idea behind both being: 

  • To generate some (synthetic) input data from actual data with some of the features (such as income, age, etc) of the data altered at random. 
  • Then to use the generated input data with the model and use the output to understand the effects of the altered features (one or more/ combinations) on the output.Thereby, understand the importance/ relevance of the features on the outputs of the model.
  • For e.g. In a loan approval/ rejection scenario by altering two features income levels & gender in the input and testing one might discover that Income levels has an effect on the decision, but no gender. 

With that background, let's look at SHAP for language models that take texts as input. Here features are the words (tokens) that comprise the input string. 

For an input like: "Glad to see you"

Shap Text Classifier

The features are: "Glad", "to", "see", "you" 

Shap would explain the impact of each word (token) on the output of the model by passing in various altered data with words MASKED:
       "* to see you",  "Glad to * you", ... 

TextClassificationTorchShap.py
shows how SHAP works with the Text Classification Model trained using the Imdb dataset. The code requires shap to be installed:   

        pip3 install shap

In terms of its working it loads up the pre-trained Text Classification model and vocabulary. Then plugs in with the library using a shap custom tokenizer to generate token_ids & offsets for the given input data. 

    masker = maskers.Text(custom_tokenizer, mask_token=SPECIAL_TOKEN_UNK)
    explainer = shap.Explainer(predict,masker=masker)
 

Finally, shap is called with some sample input text which has words masked at random. Shap collects the outputs which can be used to generate a visual report of the impact of the different words as seen below.

The model classifies any given input text as either POSITIVE (score near 1) or NEGATIVE (score near 0). The figure is showing output for two input data: "This is a great one to watch." & "What a long drawn boring affair to the end credits."

Let's look first at "This is a great one to watch.":

  • There is a base value = 0.539161 which is the model's output for a completely MASKED out input, i.e. "* * * * * * *"
  • The words "to w..", "This is" move up the score to 0.7
  • In adition, the words "a great" move up the score to 0.996787, the actual output of the model for the complete input text "This is a great one to watch."
  • The model rightly classifies this as POSITIVE with a score of 0.996787 (close to 1) 

Similarly for the text "What a long drawn boring affair to the end credits.":

  • Completely masked base value = 0.539161.
  • The key words in this case are "boring affair to the".
  • The text is rightly classified as NEGATIVE with a score of 0.0280297 (close to 0).

Monday, November 24, 2025

On Quantization

Quantization technique is employed widely these days to ML models to reduce the numerical precision of the model parameters such as weights. For context: 

  • Typical Llm weight is a floating point number in a FP32 precision, which uses 32-bits.  
  • With quantization to a lower precision Int4, which uses 4-bits, there's 8x saving per weight.

With Models having several billions to trillions of such parameters quantization results in much lower space utilization and storage requirement for the trained model. More importantly, at inference time the lower precision parameters are loaded to the memory, register, gpu much quicker than the corresponding higher precision parameters thereby increasing the inference speed significantly lowering costs, energy utilization, etc. So the benefits compound with every run. 

But then again, there are no free lunches. The quality of the results are lower with lower precision quantized models. Leading to a speed, size, cost vs quality tradeoff. There are several use cases (chat, image generation, embedded use in mobile app, etc) where the slightly lower quality outputs may be acceptable, so the quantized model wins. While for deep research, thinking, planning type use cases the full/ high precision model is preferred. 

The Keras libary makes it very easy to quantize trained models. Training is in full/ high precision while quantization is done after the model is fully trained. To explain this we return to the the trained Keras Text Classifier Model. In the TestTextClassificationTorch.py ->testQuantizeAndSaveModel() test the trained model is loaded, quantized and saved to an "int4" QUANTIZATION_MODE:

    model=keras.models.load_model(SAVE_TO_DIR+'TextClassificationTorchModel.keras')
    model.quantize(QUANTIZATION_MODE)


The quantized model can be save and also used for running inferences instead of the full precision model. For inference the same saved vocabulary of the full precision model is used by the quantized model and will have to be loaded as shown in TextClassificationTorchInference.py.

Saturday, November 22, 2025

Text Classification from Scratch using PyTorch

The AI/ ML development framework Keras 3x supports in recent times has got support for Torch & Jax backends, in addition to Tensorflow. However, given Keras's Tensorflow legacy large sections of the code are deeply integerated with Tensorflow. 

One such piece of code is text_classification_from_scratch.py from the keras-io/ examples project. Without tensorflow this piece of code simply won't run!

Here's text_classification_torch.py a pure Torch/ PyTorch port of the same code. The bits that needed modification:

  • Removing all tensorflow related imports
  • Loading the Imdb text files in "grain" format in place of "tf" format, by passing the appropriate param: 

    keras.utils->text_dataset_from_directory(format="grain") 

Also grain needs to be installed:

    pip3 install grain 

  • For building Vocab, Tokenizer, Vectorizing use torchtext:

    pip3 install torchtext

  • Few other changes such as ensure max_features constraint's honoured, text is standardized, padded, and so on   

Saturday, November 15, 2025

Guardrails & Guard-Llm's

With wide scale adoption of Llm's & Agentic models in production, there's also a pressing need to verify both the inputs & output for GenAI use cases. This should ideally be done in real-time just before serving the response to the end user. This would ensure that no invalid, harmful, hateful, confidential, etc content goes through in either direction. Guardrails are the answer to that very problem.

The simple idea with Guardrails is to apply intelligent input/ output filters that can sanitize and filter out both bad requests/ responses from getting through. There are many ways of implementing Guardrails as pattern based, rule engines, etc. Though these have worked so far, in an ever changing Agentic world it's now up to the self learning guard Llm's to judge & flag! 

Guard llm's are specifically trained to flag out harmful content. One such implementation is llama-guard which flags out violations of any of the ML Commons AI Safety Taxonomies/ Categories.

An implementation of the guard-llm can be found in the ApiCaller project. More specifically the ApiCaller->invokeWithGuardrails():

  •  First calls a local Ollama model with sanitized input to get a response
  •  Then calls the isSafe() method with the received response
  •  isSafe() internally makes a call to a different Ollama model llama-guard which flags out the content as safe/ unsafe

Check the TestApiCaller.py test case for better clarity.

References

  • https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/
  • https://www.ibm.com/think/tutorials/llm-guardrails
  • https://ollama.com/library/llama-guard3

Friday, November 14, 2025

LangWatch Scenario with Ollama

LangWatch Scenario is a framework for a Agent testing based on pytest. Scenario runs with Openai compatible api's. Here we show how to get LangWatch running using local Llm's with Ollama.

The code test_ollama_client.py is along the same lines as the test_azure_api_gateway.py from the scenario python examples folder. 

Changes specific to Ollama being:

1. Set-up

    pip3 install langwatch-scenario 

Environment variables

    export OPENAI_API_BASE_URL=http://localhost:11434/api/
    export OPENAI_API_KEY=NOTHING

2. Create Ollama client

    ollama_client() -> OpenAI(base_url=<OLLAMA_BASE_URL>)

3. Configuring the Ollama model (gemma, etc) & custom_llm_provider ("ollama") in the Agents (UserSimulatorAgent & JudgeAgent)           

    scenario.UserSimulatorAgent(model=OLLAMA_MODEL, client=custom_client, custom_llm_provider=CUSTOM_LLM_PROVIDER)...

For better clarity see test_ollama_client.py.

4. Offline LangWatch Scenario Reporter

For every run LangWatch uploads run results to app.langwatch.ai endpoint. For a truly offline run set the LANGWATCH_ENDPOINT location: 

    export LANGWATCH_ENDPOINT= <https://YOUR_REPORTING_ENDPOINT>

There's no option to disable scenario reporting for now. Only work around is to set  to LANGWATCH_ENDPOINT to an invalid value (eg "http://localhost2333/invalid").

 

Sunday, November 2, 2025

DeepEval

DeepEval helps to test and verify the correctness of LLMs. DeepEval is a framework with a suite of Metrics, Synthetic Data generation having integrations across all leading AI/ ML libraries. 

DeepEval can be used to set-up one LLM to judge the output of another LLM. This JudgeLLM set-up can be used at both the training as well as live inference stage for MlOps scenarios.

Getting started with DeepEval is simple with Ollama

(I) Installation

    pip install deepeval

Ollama installation was covered previously with a llama3.2 base model. 

(II) Set-Ollama model in DeepEval

# Unset the openai model - default for DeepEval     

deepeval unset-openai

# Set ollama model for DeepEval 

deepeval set-ollama "llama3.2:1b" --base-url="http://localhost:11434"  

(III) Create a JudgeLLM.py code

# Set up ollama model

model = OllamaModel(
  model="llama3.2:1b",
  base_url="http://localhost:11434",
  temperature=0.0,  # Example: Setting a custom temperature

# Set up evaluation metrics 

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts are true"    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
   model=model, # ollama model
rubric=[
        Rubric(score_range=(0,2), expected_outcome="Factually incorrect."),
        Rubric(score_range=(3,6), expected_outcome="Mostly correct."),
        Rubric(score_range=(7,9), expected_outcome="Correct but missing minor details."),
        Rubric(score_range=(10,10), expected_outcome="100% correct."),
    ],
#    threshold=0.1   

# define the test case

test_case_maths = LLMTestCase(
    input="what is 80 in words? using only 1 word.",
    actual_output="eighty",
    expected_output="eighty"

 # Run the evaluation

evaluate(test_cases=[test_case_maths], metrics=[answer_relevancy]) 

(IV) Execute the JudgeLLM.py 

 deepeval test run JudgeLLM.py 

 

Friday, October 31, 2025

Codegen LLMs

One GenAI feature making headlines is coding. GenAI apps are getting better with reading and writing code (codegen) in various programming languages such as Python, Java, C++ and so on.

On the evaluation side there are all kinds of benchmarks and leaderboards that track progress on codegen. Additionally, aspects of usability, platform support, IDE integration, etc are all key factors for using codegen.

In terms of local evaluations, Ollama provides handy options. With Ollama it's easy to download and run LLMs from various providers (Llama, Gemma, etc). Most now support codegen and readily follow instructions in a chat to churn out basic level Python code:

  •  Llama3.2
  • Gemma3
  • Codegemma 
  • Falcon3
  • Starcoder