Algorithms, Design, Code and more: machine learning

Showing posts with label machine learning. Show all posts

Friday, December 19, 2025

Reinforcement Learning

An important ML training paradigm is Reinforcement Learning (RL). RL models rely on a reward value generated at the end of each training run/ epoch to update the parameters (weights) of the model. This is different from the other ML methods such as Supervised Learning where labelled data/ examples are given from which the models learns. It's is also different from the Unsupervised Learning approach where inherent features of the unlabeled data are explored by the model through the learning phase to identify clusters, etc.

The keras-io examples has some RL implementations such as actor_critic, ppo, etc. All of them work solely with the TensorFlow (tf) backend. In keras_io_examples_rl these have been ported to the Torch/ PyTorch backend. The typical changes include:

Torch Imports
Use torch specific Optimizer - torch.optim.Adam

deep_q_network_breakout_pytorch () requires grad_clipping, in torch done before optimizer.step()

Gradient computations in torch

Replace tf GradientTape with torch autograd
Disable gradient globally torch.set_grad_enabled(False)
Enable autograd within specific flows/ methods where needed
Call loss.backward(), optimizer.step() for backpropagation

Few torch specific tensor & function changes/ wrappers

The ported pytorch compatible files are:

References

http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf
https://hal.inria.fr/hal-00840470/document
https://link.springer.com/content/pdf/10.1007/BF00992698.pdf
https://www.semanticscholar.org/paper/Human-level-control-through-deep-reinforcement-Mnih-Kavukcuoglu/340f48901f72278f6bf78a04ee5b01df208cc508
Continuous control with deep reinforcement learning: https://arxiv.org/abs/1509.02971)
Deep Deterministic Policy Gradient (DDPG)
https://gymnasium.farama.org/
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

Thursday, December 4, 2025

Drift Detection across Distinct Reviews Datasets

Model Drift leads to invalid results from AI/ ML inference models in production. There could be various causes of Drift such as conceptual drift, structural changes and ingestion pipeline issues with upstream data sources, domain change, prompt injections and other model exploits, etc. These lead to the AI/ ML model that was trained on a certain kind(s) of data having to run inferences on a different (drifted) dataset which causes to wrong/ incorrect results. So Drift detection (periodical, near real-time, etc) is crucial for any productionized model.

As mentioned previously Evidently is a handy library to do drift detection. Evidently has features like Metrics, Descriptors, Eval, etc that can be plugged in to detect drift in the current data vis-a-vis a reference baseline data (~training data).

In the DriftTextReviews.py Drift detection is done for an existing Text Classification model in PyTorch originally trained on an Imdb Movie's review dataset. For Reference data a sample of the same Imdb Movie data is used. For Current, data from a completely different domain of Code Reviews is used. As expected, significant drift was detected for these two review datasets from two completely different domains. Evidently reports below make the drift evidently clear!

The characteristic words have changed across the two domains. While the movie domain includes words like frame, character, minutes, etc, the coding domain has words like readable, test, method, etc.
In terms of Length of the text, Imdb reviews are much much longer and include many more words than the Code reviews. These word length and count features hooked in as Descriptors are duly detected and shown in the reports.
Interestingly, the Labels Positive (1)/ Negative (0) show no Drift. Across both datasets an equal no of the Positive/ Negative Labeles is seen.


Fig 1: Drift Review Length & Word Count	Fig 2: No Drift in Label

Fig 3: Characteristic Words - Current	Fig 4: Characteristic Words - Reference

Tuesday, December 2, 2025

Mixture of Experts and Switch Transformer

Mixture of Experts (MoE) is an innovative horizontal scaling technique employed to the basic Transformer architecture. The Feed Forward (FFN) Layer of the Transformer is replaced with a MoE layer which is a collection of N-Experts (each one a seperate FFN) in parallel. The MoE also includes a Router layer with a gating logic (learnt) to decide the expert(s) to route the token to.

One of the early MoE based Transformers was the Switch Transformer with a MoE routing layer. The Switch Transformer specifically includes logic to enable balancing of token loads across the different Experts in order to prevent hot-spots, where only a few experts end up handling a majority of the tokens. This also leads to a second issue where the other experts remain untrained through training thereby rendering them useless for inference.

There are several sota MoE implementations available on the different ML platforms. The Keras-io examples has an implementation of the Switch Transformer. The code text_classification_switch_transformer_pytorch.py is a PyTorch port of the same code with couple of changes done to make the code modular and resolve issues with super.init() call and position_in_expert.

Further, a much simpler SwitchRouter implementation is done in SwitchTransformerUtil.SimpleSwitchRoute(). The code flow is:

Compute gateLogits, with option to add Noise to load balance during training
Compute weights & selectedExperts indexes of the topK experts
Compute auxLoss to be minimized for balancing load across experts
Finally, for every expert, fetch weights, invoke expert to get the outputs
Also drop tokens beyond expert capacity threshold

Fairly straightforward!

References

https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe?utm_source=publication-search
https://medium.com/@pilliudayaditya1207/understanding-mixture-of-experts-switch-transformers-load-balancing-vs-mixtral-s-natural-balance-25ed528cadfe
https://huggingface.co/blog/NormalUhr/moe-balance
https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Monday, December 1, 2025

Evidently - Model Drift

Evidently is a Python library which to evaluate and monitory AI/ ML project. Evidently can be used to detect Drift in models seen over time.

Reports from running Evidently Metrics Cookbook gives a good feel of its capabilities and features. More to follow...


Fig 1: Drift Report	Fig 2: Generator Drift Report

References

https://www.nannyml.com/blog/monitoring-computer-vision
https://www.labellerr.com/blog/computer-vision-data-drift/
https://blog.roboflow.com/monitor-data-drift-computer-vision/
https://www.nannyml.com/blog/monitoring-computer-vision
https://nexla.com/ai-infrastructure/data-drift/
https://cobusgreyling.medium.com/llm-drift-prompt-drift-chaining-cascading-fa8fbf67c0fd
https://www.splunk.com/en_us/blog/learn/model-drift.html
https://en.wikipedia.org/wiki/Concept_drift
https://arize.com/model-drift/

Saturday, November 29, 2025

Fine Tuning Text Classification Model

Fine tuning is a technique employed to a base fully trained model (foundation) and retraining/ repurposing it to meet some different objective(s). The key aspect of fine tuning being that it is not a complete/ full retraining of the base model. It's done on a much smaller training budget keeping the weights of the original model intact, and bringing in a much smaller additional set of trainable weights known as adapters.

These adapter weights are typically of Low Rank matrices thus the name LoRA (Low Rank Adapters). With every round of training only these LoRA weights get updated while the weights from the base model stay frozen. Since final weights are additive, these can simply be added giving us the following LoRA model:

output = f(W_base*x + b_base + B*A*x), where for any given input x

W_base, b_base: Base model weights & bias which remain fixed
B, A: Low Rank Adapter weights of a small rank (r), which are trained during fine tuning
f: Activation Function

In the example TextClassificationFineTuningLora.py the working of the LoRA adapter for fine tuning a Text Classification model is demonstrated.

Fine Tuning Objective

The base model is a Sentiment Classifier on the Imdb database which labels data as either 1 (Positive) or 0 (Negative).
The base model is repurposed as an Exaggeration detector that labels the data as either 1 (Has-Exaggerations) or 0 (No Exaggerations).
Exaggerations are any reviews containing the exaggerations defined in: TextClassificationTorchUtilities.exaggerations. Also see: TextClassificationTorchUtilities.buildExaggerationDataset().

Fine Tuning Details

The base model had 2.67 Mn total parameters of which 8.86 Lakh paramters were trainable. For fine tuning these 8.86 Lakh parameters are all frozen.
The LoRA model is employed to every trainable layer of the base model. Each trainable layer of the base model is set to enable_lora(rank=4). This results in total trainable parameters of just ~30.6K.
After fine tuning the model is able to identify Exaggerations with an accuracy in the high 90's.

Friday, November 28, 2025

Knowledge Distillation

Knowledge distillation from a trained large Teacher model to a smaller Student model is a very popular technique in Machine Learning (ML). Distillation helps to train a Student model which despite being much smaller and compressed shows performance comparable to the larger Teacher model.

The other advantage of Distillation is that the Student model requires a much smaller set of labelled training data (<10%) since it's essentially trying to match the output of the Teacher during training. The Distillation loss is a function of the difference between the prediction of the Student (y_pred) & the Teacher models (teacher_pred) for every training input (x). Kullback-Leibler divergence (KLDivergence) loss between student_pred (y_pred) & teacher_pred is a common pick for the Distillation loss.

For a working example of Distillation refer to TextClassificationDistillation.py (Student) which is distilled from a Keras Text Classification model in Torch (Teacher). The original Text Classification Teacher model had several Convolution layers which have been replaced by a Dense layer. Also the Input Embedding layer's output dimension has been reduced from 128 to 32.

The original Text Classification model (Teacher) had ~2.67 Mn parameters (8.9 Lakh trainable) and was trained with 25K data samples. The distilled Student model has only ~1.6 Lakh parameters (~18%) and was trained using 2.5K samples (~10%). In terms of the size of the saved models the Teacher model is 10.2MB vs 0.6 MB of the student. There was only a marginal 4% drop in accuracy seen with the Student model on the held-out test data.


Fig 1: Text Classification - Teacher Model	Fig 2: Keras Text Classification - Student Model

Wednesday, November 26, 2025

Explainable AI

With widespread adoption of large Machine Learning (ML) models, there's a real need for understanding the workings of the models. The model otherwise just appears to be a black-box doing its thing without the end user really knowing the whys/ hows behind the models responses, choices, decisions, etc. Looking inside the model - the white-box approach - while possible is simply not practical for 99.99..9% users.

Local Interpretable Model-Agnostic Explanations (LIME) & Shapley Additive Explanations (SHAP) are two black-box techniques that help explaining the workings of such models. The key idea behind both being:

To generate some (synthetic) input data from actual data with some of the features (such as income, age, etc) of the data altered at random.
Then to use the generated input data with the model and use the output to understand the effects of the altered features (one or more/ combinations) on the output.Thereby, understand the importance/ relevance of the features on the outputs of the model.
For e.g. In a loan approval/ rejection scenario by altering two features income levels & gender in the input and testing one might discover that Income levels has an effect on the decision, but no gender.

With that background, let's look at SHAP for language models that take texts as input. Here features are the words (tokens) that comprise the input string.

For an input like: "Glad to see you",

The features are: "Glad", "to", "see", "you"

Shap would explain the impact of each word (token) on the output of the model by passing in various altered data with words MASKED:
"* to see you", "Glad to * you", ...

TextClassificationTorchShap.py shows how SHAP works with the Text Classification Model trained using the Imdb dataset. The code requires shap to be installed:

pip3 install shap

In terms of its working it loads up the pre-trained Text Classification model and vocabulary. Next it plugs in to the shap library using a shap custom tokenizer to generate token_ids & offsets for the given input data.

masker = maskers.Text(custom_tokenizer, mask_token=SPECIAL_TOKEN_UNK)
explainer = shap.Explainer(predict,masker=masker)

Finally, shap is called with some sample input text which has words masked at random. Shap collects the outputs which can be used to generate a visual report of the impact of the different words as seen below.

The model classifies any given input text as either POSITIVE (score near 1) or NEGATIVE (score near 0). The figure is showing output for two input data: "This is a great one to watch." & "What a long drawn boring affair to the end credits."

Let's look first at "This is a great one to watch.":

There is a base value = 0.539161 which is the model's output for a completely MASKED out input, i.e. "* * * * * * *"
The words "to w..", "This is" move up the score to 0.7
In addition, the words "a great" move up the score to 0.996787, the actual output of the model for the complete input text "This is a great one to watch."
The model rightly classifies this as POSITIVE with a score of 0.996787 (close to 1)

Similarly for the text "What a long drawn boring affair to the end credits.":

Completely masked base value = 0.539161.
The key words in this case are "boring affair to the".
The text is rightly classified as NEGATIVE with a score of 0.0280297 (close to 0).

Monday, November 24, 2025

Model Quantization in Keras

Quantization technique is employed widely these days to ML models to reduce the numerical precision of the model parameters such as weights. For context:

Typical Llm weight is a floating point number in a FP32 precision, which uses 32-bits.
With quantization to a lower precision Int4, which uses 4-bits, there's 8x saving per weight.

With Models having several billions to trillions of such parameters quantization results in much lower space utilization and storage requirement for the trained model. More importantly, at inference time the lower precision parameters are loaded to the memory, register, gpu, tpu, etc much quicker than the corresponding higher precision parameters thereby increasing the inference speed significantly lowering costs, energy utilization, etc. So the benefits compound with every run.

But then again, there are no free lunches. The quality of the results are lower with lower precision quantized models. Leading to a speed-size-cost vs quality tradeoff. There are several use cases (chat, image generation, embedded use in mobile app, etc) where the slightly lower quality outputs may be acceptable, so the quantized model wins. Similarly with object classifiers sometimes a confidence score of 88% is as good as a higher precision 88.238871%! While for deep research, thinking, planning type use cases the full/ high precision model is preferred.

The Keras libary makes it very easy to quantize trained models. Training is in full/ high precision while quantization is done after the model is fully trained. To explain this we return to the the trained Keras Text Classifier Model. In the TestTextClassificationTorch.py ->testQuantizeAndSaveModel() the trained model is loaded, quantized and saved to an "int4" QUANTIZATION_MODE:

model=keras.models.load_model(SAVE_TO_DIR+'TextClassificationTorchModel.keras')
model.quantize(QUANTIZATION_MODE)

After that the quantized model can be saved and used for running inferences in place of the full precision model. For inference the same saved vocabulary of the full precision model is used by the quantized model and will need to be loaded as shown in TextClassificationTorchInference.py.

Saturday, November 22, 2025

Text Classification from Scratch using PyTorch

The AI/ ML development framework Keras 3x in recent times has introduced support for Torch & Jax backends in addition to Tensorflow. However, given Keras's Tensorflow legacy large sections of the code are deeply integrated with Tensorflow.

One such piece of code is the text_classification_from_scratch.py from keras-io/examples project. Without tensorflow this piece of code simply doesn't run!

Here's text_classification_torch.py a Torch/ PyTorch port of the same code. The bits that needed modification were:

Removing all tensorflow related imports
Loading the Imdb text files in "grain" format in place of "tf" format, by passing the appropriate param:

keras.utils->text_dataset_from_directory(format="grain")

Which obviously needed grain to be installed:

pip3 install grain

Using torchtext for Vocab/ Tokenizer/ Vectorizing :

pip3 install torchtext

Few other changes such as ensure max_features constraint's honoured, text is standardized, padded, and so on.

Saturday, November 15, 2025

Guardrails & Guard-Llm's

With wide scale adoption of Llm's & Agentic models in production, there's also a pressing need to verify/ secure the GenAI inputs & outputs. This should ideally be done in real-time before serving the response to the end user, to ensure that no invalid, harmful, hateful, confidential, etc content goes through in either direction. In the GenAI world such issues are handled via Guardrails.

The simple idea with Guardrails is to apply intelligent input/ output filters that can sanitize and filter out both bad requests/ responses from getting through. There are many ways of implementing Guardrails as pattern based, rule engines, etc. Though these have worked so far, in an ever changing Agentic world it's now up to the self learning guard Llm's to judge & flag inappropriate inputs/ outputs!

Guard llm's are specifically trained to flag out harmful content. One such implementation is llama-guard which flags out violations of any of the ML Commons AI Safety Taxonomies/ Categories.

An implementation of the guard-llm can be found in the ApiCaller project. More specifically the ApiCaller->invokeWithGuardrails():

First calls a local Ollama model with sanitized input to get a response
Then calls the isSafe() method with the received response

isSafe() internally makes a call to a different Ollama model llama-guard which marks the content as safe/ unsafe

Check the TestApiCaller.py test case for better clarity.

References

https://mlcommons.org/2024/04/mlc-aisafety-v0-5-poc/
https://www.ibm.com/think/tutorials/llm-guardrails
https://ollama.com/library/llama-guard3

Friday, November 14, 2025

LangWatch Scenario with Ollama

LangWatch Scenario is a pytest based framework for testing AI Agents. Scenario runs with Openai compatible api's. Here we show how to get LangWatch running using local Llm's with Ollama.

The code test_ollama_client.py is along the same lines as the test_azure_api_gateway.py from the scenario python examples folder.

Changes specific to Ollama being:

1. Set-up

pip3 install langwatch-scenario

Environment variables

export OPENAI_API_BASE_URL=http://localhost:11434/api/
export OPENAI_API_KEY=NOTHING

2. Create Ollama client

ollama_client() -> OpenAI(base_url=<OLLAMA_BASE_URL>)

3. Configuring the Ollama model (gemma, etc) & custom_llm_provider ("ollama") in the Agents (UserSimulatorAgent & JudgeAgent)

scenario.UserSimulatorAgent(model=OLLAMA_MODEL, client=custom_client, custom_llm_provider=CUSTOM_LLM_PROVIDER)...

For better clarity see test_ollama_client.py.

4. Offline LangWatch Scenario Reporter

For every run LangWatch uploads run results to app.langwatch.ai endpoint. For a truly offline run set the LANGWATCH_ENDPOINT location:

export LANGWATCH_ENDPOINT= <https://YOUR_REPORTING_ENDPOINT>

There's no option to disable scenario reporting for now. Only work around is to set to LANGWATCH_ENDPOINT to an invalid value (eg "http://localhost2333/invalid").

Wednesday, November 5, 2025

Agent2Agent (A2A) with a2a-sdk and Http2

Continuing with A2A evaluation next up is a2a-sdk (unrelated to previously evaluated a2a-server). This evaluation is largely based on getting the hello world from the a2a-samples project working as per the instruction of a2a-protocol. With additional integration with other Http2 based non Python clients.

(I) Installation

pip install a2a-sdk # uvicorn python-dotenv (packages existing)

# For Http2 support

pip install hypercorn

pip install h2==4.2.0 (See Issue 1 at the end & the bug details)

git clone https://github.com/a2aproject/a2a-samples.git -b main --depth 1

(II) Replace uvicorn with hypercorn for Http2

The a2a-samples make use of the uvicorn python server. However, uvicorn is a Http1.x compliant server and doesn't support Http2. Keep seeing the following messages if client requests from Http2:

"WARNING: Unsupported upgrade request. "

In order to support a wider & more updated category of clients, uvicorn is replaced with a hypercorn which is Http2 compliant.

In order to switch to hypercorn, the following changes are done to _main_.py of helloworld python project:

#import uvicorn

# Use Hypercorn for Http2
import asyncio
from hypercorn.config import Config
from hypercorn.asyncio import serve

....

config = Config()
config.bind="127.0.0.1:8080" # Binds address/ port

asyncio.run(serve(server.build(), config))
# uvicorn.run(server.build(), host='127.0.0.1', port=8080, log-level='debug')

(III) Run helloworld

python a2a-samples/samples/python/agents/helloworld/__main__.py

(IV) View AgentCard

Open in the browser or via curl:

curl http:///127.0.0.1:8080/.well-known/agent-card.json

Response:

{"capabilities":{"streaming":true},"defaultInputModes":["text"],"defaultOutputModes":["text"],"description":"Just a hello world agent","name":"Hello World Agent","preferredTransport":"JSONRPC","protocolVersion":"0.3.0","skills":[{"description":"just returns hello world","examples":["hi","hello world"],"id":"hello_world","name":"Returns hello world","tags":["hello world"]}],"supportsAuthenticatedExtendedCard":true,"url":"http://127.0.0.1:8080/","version":"1.0.0"}

For the Authorized Extended Agent Card:

curl -H "Authorization: Bearer dummy-token-for-extended-card" --http2 http://127.0.0.1:8080/agent/authenticatedExtendedCard

Response:

{"capabilities":{"streaming":true},"defaultInputModes":["text"],"defaultOutputModes":["text"],"description":"The full-featured hello world agent for authenticated users.","name":"Hello World Agent - Extended Edition","preferredTransport":"JSONRPC","protocolVersion":"0.3.0","skills":[{"description":"just returns hello world","examples":["hi","hello world"],"id":"hello_world","name":"Returns hello world","tags":["hello world"]},{"description":"A more enthusiastic greeting, only for authenticated users.","examples":["super hi","give me a super hello"],"id":"super_hello_world","name":"Returns a SUPER Hello World","tags":["hello world","super","extended"]}],"supportsAuthenticatedExtendedCard":true,"url":"http://127.0.0.1:8080/","version":"1.0.1"}

(V) Send/ Receive message to Agent

curl -H "Content-Type: application/json" http:///127.0.0.1:8080 -d '{"jsonrpc":"2.0","id":"ee22f765-0253-40a0-a29f-c786b090889d","method":"message/send","params":{"message":{"role":"user","parts":[{"text":"hello there!","kind":"text"}],"messageId":"ccaf4715-712e-40c6-82bc-634a7a7136f2","kind":"message"},"configuration":{"blocking":false}}}'

Response:

{"id":"ee22f765-0253-40a0-a29f-c786b090889d","jsonrpc":"2.0","result":{"kind":"message","messageId":"d813fed8-58cd-4337-8295-6282930d4d4e","parts":[{"kind":"text","text":"Hello World"}],"role":"agent"}}

(VI) Send/ Receive via Http2

curl -iv --http2 http://127.0.0.1:8080/.well-known/agent-card.json

curl -iv --http2 -H "Content-Type: application/json" http://127.0.0.1:8080 -d '{"jsonrpc":"2.0","id":"ee22f765-0253-40a0-a29f-c786b090889d","method":"message/send","params":{"message":{"role":"user","parts":[{"text":"dragons and wizards","kind":"text"}],"messageId":"ccaf4715-712e-40c6-82bc-634a7a7136f2","kind":"message"},"configuration":{"blocking":false}}}'

(The responses are the same as shown above)

(VII) Send/ Receive from Java client

TBD

(VIII) Issues

Issue 1: Compatibility issue with hypercorn (ver=0.17.3) & latest h2 (ver=4.3.0)

Ran in to the issue in the mentioned here:

| File "../lib/python3.13/site-packages/hypercorn/protocol/h2.py", line 138, in initiate
| event = h2.events.RequestReceived()
| TypeError: RequestReceived.__init__() missing 1 required keyword-only argument: 'stream_id'

Issue was resolved by downgrading to h2 (ver=4.2.0).

Tuesday, November 4, 2025

Agent2Agent (A2A) with a2a-server

Agent2Agent (A2A) is a protocol for AI agents to communicate amongst themselves. These Agents though built by different vendors by subscribing to the common a2a protocol have a standardized way of invoking/ operating with each other. To get going with A2A:

(I) Install a2a-server

pip install a2a-server

Issue 1: Compatibility issue between latest a2a-server & a2a-json-rpc:

a2a-server also installs a2a-json-rpc, but there're compatibility issues between the latest a2a-json-rpc (ver. 0.4.0) & a2a-server (ver. 0.6.1)

ImportError: cannot import name 'TaskSendParams' from 'a2a_json_rpc.spec' (.../python3.13/site-packages/a2a_json_rpc/spec.py)

Downgrading a2a-json-rpc to ver 0.3.0 fixes it:

pip install a2a-json-rpc==0.3.0

(II) Set up agent.yaml: a2a-server agent.yaml file needs to be built with the configs like host, port, handler, provider, model, etc:

server:
host: 127.0.0.1
port: 8080

handlers:
use_discovery: false
default_handler: chuk_pirate
chuk_pirate:
type: a2a_server.tasks.handlers.chuk.chuk_agent_handler.ChukAgentHandler
agent: a2a_server.sample_agents.chuk_pirate.create_pirate_agent
name: chuk_pirate
enable_sessions: false
enable_tools: false
provider: "ollama"
model: "llama3.2:1b"
version: "1.0.1"

agent_card:
name: Pirate Agent
description: "Captain Blackbeard's Ghost with conversation memory"
capabilities:
streaming: false
sessions: false
tools: false

(III) Start a2a-server

a2a-server -c agent.yaml --log-level debug

(IV) Test a2a-server endpoint from browser

Open http://127.0.0.1:8080/ which will lists the different Agents/ Agent Cards:

http://127.0.0.1:8080/chuk_pirate/.well-known/agent.json

However, this didn't quite work. It rather led to uncovering issues with a2a-server..

Issue 2: Agent Card endpoint url

Firstly, the Agent Card end point is that this is no longer a valid end point. As per the latest Agent Card protocol the Agent Card needs to be served from the location: http://<base_url>/ .well-known/agent-card.json.

agent-card.json (& not agent.json)
Without the agent's name (i.e. without chuk_pirate)

The valid one would looks like:

http://127.0.0.1:8080/chuk_pirate/.well-known/agent.json

Issue 3: Error message/send not found

The other issue is that the seems to be a lack of support for the method "message/ send" used to send messages and chat with the agent. The curl request fails with an error:

curl -iv -H "Content-Type: application/json" http://127.0.0.1:8080/chuk_pirate -d '{"jsonrpc":"2.0","id":"ee22f765-0253-40a0-a29f-c786b090889d","method":"message/send","params":{"message":{"role":"user","parts":[{"text":"hello there!","kind":"text"}],"messageId":"ccaf4715-712e-40c6-82bc-634a7a7136f2","kind":"message"},"configuration":{"blocking":false}}}'

{"jsonrpc":"2.0","id":"ee22f765-0253-40a0-a29f-c786b090889d","result":null,"error":{"code":-32601,"message":"message/send not found"}}

Due to all these issues with a2a-server and its lack of documentation there's no clarity on the library. So a no-go for the moment!

Sunday, November 2, 2025

DeepEval

DeepEval helps to test and verify the correctness of LLMs. DeepEval is a framework with a suite of Metrics, Synthetic Data generation having integrations across all leading AI/ ML libraries.

DeepEval can be used to set-up one LLM to judge the output of another LLM. This JudgeLLM set-up can be used at both the training as well as live inference stage for MLOps scenarios.

Getting started with DeepEval is simple with Ollama.

(I) Installation

pip install deepeval

Ollama installation was covered previously with a llama3.2 base model.

(II) Set Ollama model in DeepEval

# Unset the openai model - default for DeepEval

deepeval unset-openai

# Set ollama model for DeepEval

deepeval set-ollama "llama3.2:1b" --base-url="http://localhost:11434"

(III) JudgeLlmDeepEval.py using DeepEval with Ollama

# Set up ollama model

model = OllamaModel(
model="llama3.2:1b",
base_url="http://localhost:11434",
..

# Set up evaluation metrics

correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the ... ...

answer_relevancy = AnswerRelevancyMetric(threshold=0.5, model=model)

# define the test case

test_case_maths = LLMTestCase(
input="what is 80 in words? using only 1 word.",
actual_output="eighty",
expected_output="eighty"
)

# Run the evaluation

evaluate(test_cases=[test_case_maths], metrics=[answer_relevancy])

(IV) Run JudgeLlmDeepEval.py

deepeval test run JudgeLLM.py

Friday, October 31, 2025

Codegen LLMs

One GenAI feature making headlines is coding. GenAI apps are getting better with reading and writing code (codegen) in various programming languages such as Python, Java, C++ and so on.

On the evaluation side there are all kinds of benchmarks and leaderboards that track progress on codegen. Additionally, aspects of usability, platform support, IDE integration, etc are all key factors for using codegen.

In terms of local evaluations, Ollama provides handy options. With Ollama it's easy to download and run LLMs from various providers (Llama, Gemma, etc). Most now support codegen and readily follow instructions in a chat to churn out basic level Python code:

Llama3.2
Gemma3
Codegemma
Falcon3
Starcoder

Thursday, October 30, 2025

Langchain4j

LangChain is one of the leading python based AI/ ML, agentic modelling and integration frameworks. Langchain (and allied frameworks like LangGraph) allows integration with almost all LLMs, python libraries and tools out there.

Langchain4j is its Java counterpart. Langchain4j allows LLM integrations and workflows to be built using pure Java constructs. It primarily operates as a Java client to the various api's exposed by the different LLM provides such as OpenAi, Azure, Bedrock, Gemini and so on.

Langchain4j has covered a lot of ground in terms of the supported modules from both the Python and the Java ecosystems. It's actively supported and should be one for the long run..

To get a feel for Langchain4j on a local LLM try out langchain4j-ollama which gets:

Java langchain4j-ollama to talk to

-> Ollama (deployed locally)

-> Hosting the llama3.2:1b model

(I) Get a local Ollama up & running

Refer to the previous post regarding installing/ getting Ollama running locally. Once done, you should have a llama3.2:1b model running & ready to chat locally on:

http://127.0.0.1:11434

(II) Download & build langchain4j-ollama project

Clone langchain4j-ollama project & build:

cd </download/folder/langchain4j-ollama>

mvn install

(III) Run langchain4j-ollama tests

Run a couple of the langchain4j-ollama integration tests. Start with OllamaChatModelIT.java. Make sure to update the Model_Name value to llama3.2:1b downloaded in step (I) above:

static final String MODEL_NAME = "llama3.2:1b";

That's about it for getting the three pieces integrated & chatting!

Wednesday, October 29, 2025

Ollama for Local Llm

Continuing with the theme of running Llm's locally, it was ideal to go with Ollama which has gained significant ground over the last one odd year. While tools like Gpt4All, Llamafile, etc are all based on llama.cpp, Ollama has its own model format.

More importantly Ollama supports older CPU's (non-AVX), something stopped by llama.cpp and the others.

(I) Installation

Ollama installation is on a box with Ubuntu 25.10 with the latest libraries installed.

sudo snap install ollama

(II) Pull Ollama model

ollama pull llama3.2:1b

To view the downloaded models using 'ollama list' command:

ollama list

|__

NAME ID SIZE MODIFIED
llama3.2:1b baf6a787fdff 1.3 GB ...

Models by default get downloaded to: /var/snap/ollama/common/models, verified via:

sudo snap get ollama

|__

Key Value
context-length
cuda-visible-devices
debug 0
flash-attention 0
host 127.0.0.1:11434
models /var/snap/ollama/common/models
origins *://localhost

To change the default location the config models needs to be updated:

sudo snap set ollama models=/UPDATED/MODELS/LOCATION/PATH

(III) Run/ chat with downloaded model:

ollama run llama3.2:1b

|__

>>> bye
Bye.

>>> what is 80 + 10?
80 + 10 = 90.

(IV) Install/ Run any locally downloaded GGUF model

Ollama also provides the option to run any downloaded model GGUF locally. These are models not downloaded via ollama pull (ref step (II)) but models downloaded from Hugging face, etc.

A simple modelfile needs to be prepared with one line instruction:

FROM </GGUF/FILE/DOWNLOAD/LOCATION>

Next the Ollama create command is to be run using the modelfile:

ollama create <CUSTOM_MODEL_NAME> -f </modefile/LOCATION>

With that a your downloaded model (GGUF) file would be available for running from Ollama and show up in:

ollama list

(Note: There's a known issue with the template of downloaded model.)

(V) Ollama API

Ollama server by default listens on the end point: http://127.0.0.1:11434

Through the endpoint various Ollama APIs are available for chatting, generating completions, list models, show details, running models, version, push, pull, etc with the installed models.

(VI) Remove Models

To remove any downloaded models run the 'ollama rm' command:

ollama rm llama3.2:1b

(VII) Stop Ollama

Stopping/ unloading of just the running model can be effected via an Ollama API call with keep_alive=0, along with an empty message:

curl http://127.0.0.1:11434/api/chat -d '{"model": "llama3.2:1b","messages": [],"keep_alive": 0}'

On the other hand, stopping of Ollama service is unimplemented. So a hard kill is the only option that works (will also unload all running models):

sudo pkill -9 ollama

Snap will however restart Ollama snap the moment it is killed as part of recovery/ restart protocol.

To completely shutdown/ disable Ollama:

sudo snap disable ollama

Tuesday, October 28, 2025

Gpt4All on Ubuntu-20

Notes from a rather tough, yet futile, attempt at getting Gpt4All to run locally on an old Ubuntu20.04 box, with Python-3.8.

* Pip Install: First up, the gpt4All installed via pip (ver 2..8.2) has changes incompatible with current/ recent model files & gguf (llama3.2, etc). Causing type, value, keyword, attribute errors etc at different stages of installation & execution.

* Custom Build: Alt. is to download the latest Gpt4All & build it.

This leads to issues with Ubuntu 20.04 library's being outdated/ missing & the hardware being outdated :

GLIBC_2.32 not found
GLIBCXX_3.4.29 not found
CMake 3.23 or higher is required. You are running version 3.16.3
Vulkan not found, version incompatible with Gpu, etc
CUDA Toolkit not found.
CPU does not support AVX

Anyway, after a lot of false steps the build did succeed with the following flags set did succeed:

cmake -B build -DCMAKE_BUILD_TYPE=Rel -DLLMODEL_CUDA=OFF -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON

Build files have been written to: .../gpt4all/gpt4all-backend/build

Even after all that there were issues popping up with getting Llms to run from libraries like langchain, pygpt4all and so on. Clearly indicating that it was time to bid adieu to Ubuntu 20.04 & upgrade to more recent and better supported versions.

References

https://python.langchain.com/docs/how_to/local_llms/
https://askubuntu.com/questions/1393285/how-to-install-glibcxx-3-4-29-on-ubuntu-20-04
https://stackoverflow.com/questions/71940179/error-lib-x86-64-linux-gnu-libc-so-6-version-glibc-2-34-not-found

Sunday, October 26, 2025

Mlflow Java client

Mlflow is a leading open source framework for managing AI/ ML workflows. Mlflow allows tracking, monitoring and generally visualizing the e2e ML project lifecycle. A handy ops side tool that improves over interpretability of AI/ ML projects.

Key Mlflow concepts are intuitively named such as ML Projects, containing Models, on which Runs of Experiments are done that in turn get Tagged with meaningful humanly relevant labels and so on.

While Mlflow is a Python native library with integrations with all the leading Python AI/ ML frameworks such as OpenAI, Langchain, Llamaindex, etc there are also Mlflow API endpoints for wider portability.

There's also a Mlflow Java Api for use from the Java ecosystem. The corresponding Mlflow Java client (maven plugin, etc) works well with the API. To get started with the mlflow using Java:

(I) Install mlflow (Getting started guide)

$ pip install mlflow

This installs mlflow to the users .local folder:

~/.local/bin/mlflow

(II) Start Local mlflow server (simple without authentication)

$ mlflow server --host 127.0.0.1 --port 8080

mlflow server should be running on

http://127.0.0.1:8080

(III) Download mlflower repo (sample Java client code)

Next clone the mlflower repo which has some sample code showing working of the mlflow Java client.

The class Mlfclient shows a simple use case of Creating an Experiment:

client.createExperiment(experimentName);

Followed by a few runs of logging some Parameters, Metrics, Artifacts:

     run.logParam();
       run.logMetric();
			
       run.logArtifact()
 

Run Hierarchy: Class NestedMlfClient shows nesting hierarchy of Mlflow runs

Parent Run -> Child Run -> Grand Child Run ->.... & so on

(IV) Start Local mlflow server (with Basic Authentication)

While authentication is crucial for managing workflows, Mlflow only provided Basic Auth till very recently. Version 3.5 onwards has better support for various auth provides, SSO, etc. For now only Mlflow Basic Auth integration is shown.

# Start server with Basic Auth
mlflow server --host 127.0.0.1 --port 8080 --app-name basic-auth

Like previously, mlflow server should start running on

http://127.0.0.1:8080

But requiring a login credential this time to access the page. The default admin credentials are mentioned on mlflow basic-auth-http.

The class BasicAuthMlfclient shows the Java client using BasicMlflowHostCreds to connect to Mlflow with basic auth.

new MlflowClient(new BasicMlflowHostCreds(TRACKING_URI, USERNAME, PASSWORD));

(V) Deletes Soft/ Hard

Experiments, Runs, etc created within mlflow can be deleted from the ui (& client). The deletes are however only Soft, and get stored somewhere in a Recycle Bin, not visible on the UI.
Hard/ permanent deletes can be effected from the mlflow cli

# Set mlflow server tracking uri

export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

# Clear garbage

mlflow gc

(VI) Issues

MlflowContext.withActiveRun() absorbs exception without any logs, & the run status set to RunStatus.FAILED.

To debug runs show as failed on the mlflow UI, its best to put explicit try-catch on the client to find the cause.

Unable to upload artifacts since cli looks for python (& not python3) on path to run.

Error message: Failed to exec 'python -m mlflow.store.artifact.cli', needed to access artifacts within the non-Java-native artifact store at 'mlflow-artifacts:
The dev box (Ubuntu ver 20.04) has python3 (& not python) installed.
Without changing the dev box a simple fix is to set/ export the environment variable MLFLOW_PYTHON_EXECUTABLE (within the IDE, shell, etc) to whichever python lib is installed on the box:

MLFLOW_PYTHON_EXECUTABLE=/usr/bin/python3

So with that keep the AI/ Ml projects flowing!

Wednesday, October 8, 2025

AI/ML '25

• GenAI
    - Text: Chat, Q&A, Compose, Summarize, Think, Search, Insights, Research
     - Image: Gen, Identify, Search (Image-Image, Text-Image, etc), Label, Multimodal
    - Code gen
    - Research: Projects, Science, Breakthroughs
    - MoE

• Agentic
    - Workflows: GenAI, DNN, Scripts, Tools, etc combined to fulfil Objectives
        -- Auto-Generated Plans & Objectives
    - Standardization: MCP (API), Interoperability, Protocols
    - RAG
    - Tools: Websearch, DB, Invoke API/ Tools/ LLM, etc

• Context
    - Fresh/ Updated
    - Length: Cost vs Speed trade-off
    - RAG
    - VectorDB (Similarity/ Relevance)
    - Memory enhanced

• Fine Tune
    - Foundation models (generalists) -> Specialists
    - LoRA
    - Inference time scaling (compute, tuning, etc)
    - Prompts

• Multimodal: Text, Audio, Video, Image, Graph, Sensors

• Safety/ Security
    - Output Quality: Relevance, Accuracy, Correctness, Evaluation (Automated Rating, Ranking, JudgeLLM, etc)
        -- Hallucination
    - Privacy, Data Leak, Backdoor, Jailbreak
    - Guard Rails