Algorithms, Design, Code and more

Friday, October 31, 2025

Codegen LLMs

One GenAI feature making headlines is coding. GenAI apps are getting better with reading and writing code (codegen) in various programming languages such as Python, Java, C++ and so on.

On the evaluation side there are all kinds of benchmarks and leaderboards that track progress on codegen. Additionally, aspects of usability, platform support, IDE integration, etc are all key factors for using codegen.

In terms of local evaluations, Ollama provides handy options. With Ollama it's easy to download and run LLMs from various providers (Llama, Gemma, etc). Most now support codegen and readily follow instructions in a chat to churn out basic level Python code:

Llama3.2
Gemma3
Codegemma
Falcon3
Starcoder

Thursday, October 30, 2025

Langchain4j

LangChain is one of the leading python based AI/ ML, agentic modelling and integration frameworks. Langchain (and allied frameworks like LangGraph) allow integration with almost all LLMs, python libraries and tools out there.

Langchain4j is its Java couterpart. Langchain4j allows LLM integrations and workflows to be built using pure Java constructs. It primarily operates as a Java client to the various Api's exposed by the different LLM provides such as OpenAi, Azure, Bedrock, Gemini and so on.

Langchain4j has covered a lot of ground in terms of the supported modules from both the Python and the Java ecosystems. It's actively supported and should be one for the long run..

To get a feel for Langchain4j on a local LLM try out langchain4j-ollama.

This will get:

Java langchain4j-ollama to talk to

-> Ollama (deployed locally)

-> Hosting the llama3.2:1b model

(I) Get a local Ollama up & running

Refer to the previous post regarding installing getting Ollama running locally. Once done, you should have a llama3.2:1b model running & ready to chat locally on:

http://127.0.0.1:11434

(II) Download & build langchain4j-ollama project

Clone langchain4j-ollama project & build:

cd </download/folder/langchain4j-ollama>

mvn install

(III) Run langchain4j-ollama tests

Run a couple of the langchain4j-ollama integration tests. Start with OllamaChatModelIT.java. Make sure to update the Model_Name value to llama3.2:1b downloaded in step (I) above:

static final String MODEL_NAME = "llama3.2:1b";

That's about it for getting the three pieces integrated & chatting!

Wednesday, October 29, 2025

Ollama for Local Llm

Continuing with the theme of running Llm's locally, it was time ideal to go with Ollama which has gained significant ground over the one odd year. While tools like Gpt4All, Llamafile, etc are all based on llama.cpp, Ollama has its own model format.

More importantly Ollama supports older CPU's (non-AVX), something stopped by llama.cpp and the others.

(I) Installation

Ollama installation is on a box with Ubuntu 25.10 with the latest libraries installed.

sudo snap install ollama

(II) Pull Ollama model

ollama pull llama3.2:1b

To view the downloaded models using 'ollama list' command:

ollama list

|__

NAME ID SIZE MODIFIED
llama3.2:1b baf6a787fdff 1.3 GB ...

Models by default get downloaded to:/var/snap/ollama/common/models, verified via:

sudo snap get ollama

|__

Key Value
context-length
cuda-visible-devices
debug 0
flash-attention 0
host 127.0.0.1:11434
models /var/snap/ollama/common/models
origins *://localhost

To change the default location the config models needs to be updated:

sudo snap set ollama models=/UPDATED/MODELS/LOCATION/PATH

(III) Run/ chat with downloaded model:

ollama run llama3.2:1b

|__

>>> bye
Bye.

>>> what is 80 + 10?
80 + 10 = 90.

(IV) Install/ Run any locally downloaded GGUF model

Ollama also provides the option to run any downloaded model GGUF locally. These are models not downloaded via ollama pull (ref step (II)) but models downloaded from Hugging face, etc.

A simple modelfile needs to be prepared with one line instruction:

FROM </GGUF/FILE/DOWNLOAD/LOCATION>

Next the Ollama create command is to be run using the modelfile:

ollama create <CUSTOM_MODEL_NAME> -f </modefile/LOCATION>

With that a your downloaded model (GGUF) file would be available for run from Ollama and show up in:

ollama list

(Note: There's a known issue with the template of downloaded model.)

(V) Ollama API

Ollama server by default listens on the end point: http://127.0.0.1:11434.

Through the endpoint various Ollama APIs are available for chatting, generating completions, list models, show details, running models, version, push, pull, etc with the installed models.

(VI) Remove Models

To remove any downloaded models run the 'ollama rm' command:

ollama rm llama3.2:1b

(VII) Stop Ollama

Stopping/ unloading of just the running model can be effected via an Ollama API call with keep_alive=0, along with an empty message:

curl http://127.0.0.1:11434/api/chat -d '{"model": "llama3.2:1b","messages": [],"keep_alive": 0}'

On the other hand, stopping of Ollama service is unimplemented. So a hard kill is the only option that works (will also unload all running models):

sudo pkill -9 ollama

Snap will however restart Ollama snap the moment it is killed (recovery/ restart).

To completely shutdown/ disable Ollama:

sudo snap disable ollama

Tuesday, October 28, 2025

Gpt4All on Ubuntu-20

Notes from a rather tough, yet futile, attempt at getting Gpt4All to run locally on an old Ubuntu20.04 box, with Python-3.8.

* Pip Install: First up, the gpt4All installed via pip (ver 2..8.2) has changes incompatible with current/ recent model files & gguf (llama3.2, etc). Causing type, value, keyword, attribute errors etc at different stages of installation & execution.

* Custom Build: Alt. is to download the latest Gpt4All & build it.

This leads to issues with Ubuntu 20.04 library's being outdated/ missing & the hardware being outdated :

GLIBC_2.32 not found
GLIBCXX_3.4.29 not found
CMake 3.23 or higher is required. You are running version 3.16.3
Vulkan not found, version incompatible with Gpu, etc
CUDA Toolkit not found.
CPU does not support AVX

Anyway, after a lot of false steps the build did succeed with the following flags set did succeed:

cmake -B build -DCMAKE_BUILD_TYPE=Rel -DLLMODEL_CUDA=OFF -DKOMPUTE_OPT_DISABLE_VULKAN_VERSION_CHECK=ON

Build files have been written to: .../gpt4all/gpt4all-backend/build

Even after all that there were issues popping up with getting Llms to run from libraries like langchain, pygpt4all and so on. Clearly indicating that it was time to bid adieu to Ubuntu 20.04 & upgrade to more recent and better supported versions.

References

https://python.langchain.com/docs/how_to/local_llms/
https://askubuntu.com/questions/1393285/how-to-install-glibcxx-3-4-29-on-ubuntu-20-04
https://stackoverflow.com/questions/71940179/error-lib-x86-64-linux-gnu-libc-so-6-version-glibc-2-34-not-found

Sunday, October 26, 2025

Mlflow Java client

Mlflow is a leading open source framework for managing AI/ ML workflows. Mlflow allows tracking, monitoring and generally visualizing end-to-end ML project lifecycles. A handy ops side tool that improves over interpretability of AI/ ML projects.

Key Mlflow concepts include ML Projects, Models on which several Runs of Experiments conducted to name a few. Experiments can also be Tagged with meaningful humanly relevant labels.

While Mlflow is a Python native library with integrations with all the leading Python AI/ ML frameworks such as OpenAI, Langchain, Llamaindex, etc there are also Mlflow API endpoints for wider portability.

There is also a specific Mlflow Java Api for use from the Java ecosystem. The corresponding Mlflow Java client (maven plugin, etc) works well with the API. To get started with the mlflow using Java:

(I) Install mlflow (Getting started guide)

$ pip install mlflow

This installs mlflow to the users .local folder:

~/.local/bin/mlflow

(II) Start Local mlflow server (simple without authentication)

$ mlflow server --host 127.0.0.1 --port 8080

mlflow server should be running on

http://127.0.0.1:8080

(III) Download mlflower repo (sample Java client code)

Next clone the mlflower repo which has some sample code showing working of the mlflow Java client.

The class Mlfclient shows a simple use case of Creating an Experiment:

client.createExperiment(experimentName);

Followed by a few runs of logging some Parameters, Metrics, Artifacts:

     run.logParam();
       run.logMetric();
			
       run.logArtifact()
 

Run Hierarchy: Class NestedMlfClient shows nesting hierarchy of Mlflow runs

Parent Run -> Child Run -> Grand Child Run ->.... & so on

(IV) Start Local mlflow server (with eBasic Authentication)

While authentication is crucial for managing workflows, Mlflow only provided Basic Auth till very recently. Version 3.5 onwards has better support for various auth provides, SSO, etc. For now only mlflow Basic Auth integration is shown.

# Start server with Basic Auth
mlflow server --host 127.0.0.1 --port 8080 --app-name basic-auth

Like previously, mlflow server should start running on

http://127.0.0.1:8080

Only requiring a login credential this time to access the page. The default admin credentials are mentioned on mlflow basic-auth-http.

The class BasicAuthMlfclient shows the Java client using BasicMlflowHostCreds to connect to Mlflow with basic auth.

new MlflowClient(new BasicMlflowHostCreds(TRACKING_URI, USERNAME, PASSWORD));

(V) Deletes Soft/ Hard

Experiments, Runs, etc created within mlflow can be deleted from the ui (& client). The deletes are however only Soft, and get stored somewhere in a Recycle Bin, not visible on the UI.
Hard/ permanent deletes can be effected from the mlflow cli

# Set mlflow server tracking uri

export MLFLOW_TRACKING_URI=http://127.0.0.1:8080

# Clear garbage

mlflow gc

(VI) Issues

MlflowContext.withActiveRun() absorbs exception without any logs, simply sets the run status to RunStatus.FAILED.

So incase runs show failure on the mlflow UI, its best to put explicit try-catch on the client to find the cause.

Unable to upload artifacts since cli looks for python (& not python3) on path to run.

Error message: Failed to exec 'python -m mlflow.store.artifact.cli', needed to access artifacts within the non-Java-native artifact store at 'mlflow-artifacts:
The dev box (Ubuntu ver 20.04) has python3 (& not python) installed.
Without changing the dev box a simple fix is to set/ export the environment variable MLFLOW_PYTHON_EXECUTABLE (within the IDE, shell, etc) to whichever python lib is installed on the box:

MLFLOW_PYTHON_EXECUTABLE=/usr/bin/python3

So with that keep the AI/ Ml projects flowing!

Wednesday, October 8, 2025

AI/ML '25

• GenAI
    - Text: Chat, Q&A, Compose, Summarize, Think, Search, Insights, Research
     - Image: Gen, Identify, Search (Image-Image, Text-Image, etc), Label, Multimodal
    - Code gen
    - Research: Projects, Science, Breakthroughs
    - MoE

• Agentic
    - Workflows: GenAI, DNN, Scripts, Tools, etc combined to fulfil Objectives
        -- Auto-Generated Plans & Objectives
    - Standardization: MCP (API), Interoperability, Protocols
    - RAG
    - Tools: Websearch, DB, Invoke API/ Tools/ LLM, etc

• Context
    - Fresh/ Updated
    - Length: Cost vs Speed trade-off
    - RAG
    - VectorDB (Similarity/ Relevance)
    - Memory enhanced

• Fine Tune
    - Foundation models (generalists) -> Specialists
    - LoRA
    - Inference time scaling (compute, tuning, etc)
    - Prompts

• Multimodal: Text, Audio, Video, Image, Graph, Sensors

• Safety/ Security
    - Output Quality: Relevance, Accuracy, Correctness, Evaluation (Automated Rating, Ranking, JudgeLLM, etc)
        -- Hallucination
    - Privacy, Data Leak, Backdoor, Jailbreak
    - Guard Rails

Friday, April 18, 2025

AI Agentic Frameworks

With prolification of AI Agents, it's only logical that there will be attempts at standardization and building protocols & frameworks:

MCP covered previously
Any-Agent from Mozilla.ai to switch between agents, vendors, clouds, etc
Agent2Agent interoperability protocol

Thursday, April 17, 2025

On Quantization

Speed vs Accuracy trade off.
Reduce costs on storage, compute, operations .
Speed up output generation, inference, etc.
Work with lower precision data.
Cast/ map data from Int32, Float32, etc 32-bit or higher precision to lower precision data types such as 16-bit Brain Float (BFloat16) or 4-bit (NFloat)/ int4 or int8, etc.

East mapping Float32 (1-bit Sign, 7-bit Exponent, 23-bit Mantissa) => BFloat16 (1-bit Sign, 7-bit Exponent, 7-bit Mantissa). Just discard the higher 16-bits of mantissa. No overflow!
Straightforward mapping work out max, min, data distribution, mean, variance, etc & then sub-divide into equally sized buckets based on bit size of the lower precision data type. E.g int4 (4-bit) => 2^4 = 16 buckets.
Handle outliers, data skew which can mess up the mapping, yet lead to loss of useful info if discarded randomly.
Work out Bounds wrt Loss of Accuracy.

LLMs, AI/ ML side:

https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs

Lucene, Search side:

https://www.elastic.co/search-labs/blog/scalar-quantization-101
https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene

Wednesday, April 16, 2025

Speculative Decoding

Ensemble of Weak + Strong model
Weak model has a quick first go at generating tokens/ inference (potentials)
Followed by the Strong, but slow model which catches up & uses the outputs of the weak model, samples them, grades them, accepting/ rejecting them to generate the final output
Overall making inferences via LLMs quicker and cheaper

More to follow..

https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/
https://www.baseten.co/blog/a-quick-introduction-to-speculative-decoding/
https://research.google/blog/looking-back-at-speculative-decoding/
https://medium.com/ai-science/speculative-decoding-make-llm-inference-faster-c004501af120

Tuesday, April 8, 2025

Revisiting the Bitter Lesson

Richard Sutton's - The Bitter Lesson(s) continue to hold true. Scaling/ data walls could pose challenges to scaling AI general purpose methods (like searching and learning) beyond a point. And that's where human innovation & ingenuity would be needed. But hang on, wouldn't that violate the "..by our methods, not by us.." lesson?

Perhaps then something akin to human innovation/ discovery/ ingenuity/ creativity might be the next frontier of meta-methods. Machines in their typical massively parallel & distributed, brute-force, systematic trial & error fashion would auto ideate/ innovate/ discover solutions quicker, cheaper, better. Over & over again.

So machine discoveries shall be abound, just not Archimedes's Eureka kind, but Edison's 100-different ways style!

Sunday, April 6, 2025

Model Context Protocol (MCP)

Standardization Protocol for AI agents. Enables them to act, inter-connect, process, parse, invoke functions. In other words to Crawl, Browse, Search, click, etc.

MCP re-uses well known client-server architecture using JSON-RPC.

Apps use MCP Clients -> MCP Servers (abstracts the service)

Kind of API++ for an AI world!

Saturday, April 5, 2025

Open Weight AI

Inspired by Open Source Software (OSS), yet not fully open...

With Open Weight (OW) typically the final model weights (& the fully trained model) are made available under a liberal free to reuse, modify, distribute, non-discriminating, etc licence. This helps for anyone wanting to start with the fully trained Open Weight model & apply them, fine-tune, modify weights (LoRA, RAG, etc) for custom use-cases. To that extent, OW has a share & reuse philosophy.

On the other hand, wrt training data, data sources, detailed architecture, optimizations details, and so on OW diverges from OSS by not making it compulsory to share any of these. So these remain closed source with the original devs, with a bunch of pros & cons. Copyright material, IP protection, commercial gains, etc are some stated advantages for the original devs/ org. But lack of visibility to the wider community, white box evaluation of model internals, biases, checks & balances are among the downsides of not allowing a full peek into the model.

Anyway, that's the present, a time of great flux. As models stabilize over time OW may tend towards OSS...

References

https://openweight.org/
https://www.oracle.com/artificial-intelligence/ai-open-weights-models/
https://medium.com/@aruna.kolluru/exploring-the-world-of-open-source-and-open-weights-ai-aa09707b69fc
https://www.forbes.com/sites/adrianbridgwater/2025/01/22/open-weight-definition-adds-balance-to-open-source-ai-integrity/
https://promptengineering.org/llm-open-source-vs-open-weights-vs-restricted-weights/
https://promptmetheus.com/resources/llm-knowledge-base/open-weights-model
https://www.agora.software/en/llm-open-source-open-weight-or-proprietary/

Wednesday, April 2, 2025

The Big Book of LLM

A book by Damien Benveniste of AIEdge. Though a work in progress, chapters 2 - 4 available for preview are fantastic.

Look forward to a paperback edition, which I certainly hope to own...

Tuesday, April 1, 2025

Mozilla.ai

Mozilla pedigree, AI focus, Open-source, Dev oriented.

Blueprint Hub: Mozilla.ai's Hub of open-source templtaized customizable AI solutions for developers.

Lumigator: Platform for model evaluation and selection. Consists a Python FastAPI backend for AI lifecycle management & capturing workflow data useful for evaluation.

Friday, March 28, 2025

Streamlit

Streamlit is a web wrapper for Data Science projects in pure Python. It's a lightweight, simple, rapid prototyping web app framework for sharing scripts.

https://streamlit.io/playground
https://www.restack.io/docs/streamlit-knowledge-streamlit-vs-flask-vs-django
https://docs.streamlit.io/develop/concepts/architecture/architecture
https://docs.snowflake.com/en/developer-guide/streamlit/about-streamlit

Saturday, March 15, 2025

Scaling Laws

Quick notes around Chinchilla Scaling Law/ Limits & beyond for DeepLearning and LLMs.

Factors

Model size (N)
Dataset size (D)
Training Cost (aka Compute) (C)
Test Cross-entropy loss (L)

The intuitive way,

Larger data will need a larger model, and have higher training cost. In other words, N, D, C all increase together, not necessarily linearly, could be exponential, log-linear, etc.
Likewise Loss is likely to increase for larger datasets. So an inverse relationship between L & D (& the rest).
Tying them into equations would be some constants (scaling, exponential, alpha, beta, etc), unknown for now (identified later).

Beyond common sense, the theoretical foundations linking the factors aren't available right now. Perhaps the nature of the problem is it's hard (NP).

The next best thing then, is to somehow work out the relationships/ bounds empirically. To work with existing Deep Learning models, LLMs, etc using large data sets spanning TB/ PB of data, Trillions of parameters, etc using large compute budget cumulatively spanning years.

Papers by Hestness & Narang, Kaplan, Chinchilla are all attempts along the empirical route. So are more recent papers like Mosaic, DeepSeek, MoE, Llam3, Microsoft among many others.

Key take away being,

The scale & bounds are getting larger over time.
Models from a couple of years back, are found to be grossly under-trained in terms of volumes of training data used. They should have been trained on an order of magnitude larger training data for an optimal training, without risk of overfitting.
Conversely, the previously used data volumes are suited to much smaller models (SLMs), with inference capabilities similar to those older LLMs.

References

https://en.wikipedia.org/wiki/Neural_scaling_law
https://lifearchitect.ai/chinchilla/
https://medium.com/@raniahossam/chinchilla-scaling-laws-for-large-language-models-llms-40c434e4e1c1
https://bigscience.huggingface.co/blog/what-language-model-to-train-if-you-have-two-million-gpu-hours
https://medium.com/nlplanet/two-minutes-nlp-scaling-laws-for-neural-language-models-add6061aece7
https://lifearchitect.ai/the-sky-is-bigger/

Friday, February 28, 2025

Diffusion Models

Diffusion

Forward, Backward (Learning), Sampling (Random)
Continous Diffusion
VAE, Denoising Autoencoder
Markov Chains
U-Net
DALL-E (OpenAI), Stable Diffusion,
Imagen, Muse, VEO (Google)
LLaDa, Mercury Coder (Inception)

Non-equilibrium Thermodynamics

Langevin dynamics
Thermodynamic Equilibrium - Boltzmann Distribution
Wiener Process - Multidimensional Brownian Motion
Energy Based Models

Gaussian Noise

Denoising
Noise/ Variance Schedule
Derivation by Reparameterization

Variational Inference

Denoising Diffusion Probabilistic Model (DDPM)
Noise Prediction Networks
Denoising Diffusion Implicit Model (DDIM)

Loss Functions

Variational Lower Bound (VLB)
Evidence Lower Bound (ELBO)
Kullback-Leibler divergence (KL divergence)
Mean Squared Error (MSE)

Score Based Generative Model

Annealing
Noise conditional score network (NCSN)
Equivalence: DDPM and Score BBased Generative Models

Conditional (Guided) Generation

Classifier Guidance
Classifier Free Guidance (CFG)

Latent Varible Generative Model

Latent Diffusion Model (LDM)
Lower Dimension (Latent) Space

References:

https://en.wikipedia.org/wiki/Diffusion_model
https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction
https://www.ibm.com/think/topics/diffusion-models
https://hackernoon.com/what-is-a-diffusion-llm-and-why-does-it-matter
Large Language Diffusion Models (LLaDA): https://arxiv.org/abs/2502.09992

Sunday, January 26, 2025

Mechanistic Interpretability

Clearer better understanding of Neural Networks working (white box).
Strong grounds for Superposition: n-dimensions (neurons) represent more than n-features

References

https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=EuO4CLwSIzX7AEZA1ZOsnwwF
https://www.neelnanda.io/mechanistic-interpretability/glossary
https://transformer-circuits.pub/2022/toy_model/index.html
https://www.anthropic.com/research/superposition-memorization-and-double-descent
https://transformer-circuits.pub/2023/toy-double-descent/index.html

Friday, January 24, 2025

State Space Models

Vector Space of States (of the System)
Alt. to Transformers, reducible to one another

(Image source: https://en.wikipedia.org/wiki/State-space_representation)

References

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state
https://huggingface.co/blog/lbourdois/ssm-2022
https://huggingface.co/blog/lbourdois/get-on-the-ssm-train
https://en.wikipedia.org/wiki/State-space_representation

Monday, January 6, 2025

Spark API Categorization

A way to categorize Spark API features:

Flow of data is generally across the category swim lanes, from creation of a New Spark Context to reading data using I/O to Filter, Map/ Transform, Reduce/ Agg etc Action.
Lazy processing upto Transformation.
Steps only get executed once an Action is invoke.
Post Actions (Reduce, Collect, etc) there could again be I/O, thus the reverse flow from Action
Partition is a cross cutting concern across all layers. For I/O, Transformations, Actions could be across all or a few Partitions.
forEach on the Stream could be at either at Transform or Action levels.

The diagram is based on code within various Spark test suites.

Thursday, January 2, 2025

Mocked Kinesis (Localstack) with PySpark Streaming

Continuing with the same PySpark (ver 2.1.0, Python3.5, etc.) setup explained in an earlier post. In order to connect to the mocked Kinesis stream on Localstack from PySpark use the kinesis_wordcount_asl.py script located in Spark external/ (connector/) folder.

(a) Update value of master in kinesis_wordcount_asl.py

Update value of master(local[n], spark://localhost:7077, etc) in SparkContext in kinesis_wordcount_asl.py:
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl",master="local[2]")

(b) Add aSpark compiled jars to Spark Driver/ Executor Classpath

As explained in step (III) of an earlier post, to work with Localstack a few changes were done to the KinesisReceiver.scala onStart() to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. Accordingly the compiled aSpark jars with the modifications need to be added to Spark Driver/ Executor classpath.

For Spark local mode (master="local[n]"): additions to classpath can be exported in the SPARK_CLASSPATH variable.

export aSPARK_PROJ_HOME="/Downlaod/Location/aSpark"
export SPARK_CLASSPATH="${aSPARK_PROJ_HOME}/target/original-aSpark_1.0-2.1.0.jar:${aSPARK_PROJ_HOME}/target/scala-2.11/classes:${aSPARK_PROJ_HOME}/target/scala-2.11/jars/*"

For Spark Standalone mode: "spark.executor.extraClassPath" needs to be set in either spark-defaults.conf or added as a SparkConf to SparkContext (see (II)(a))

(c) Ensure SPARK_HOME, PYSPARK_PYTHON & PYTHONPATH variables are exported.

(d) Run kinesis_wordcount_asl

python3.5 ${SPARK_HOME}/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py SampleKinesisApplication myFirstStream http://localhost:4566/ us-east-1

put-records to Localstack Kinesis

aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"

Count of the words streamed (put) will show up on the kinesis_wordcount_asl console

Wednesday, January 1, 2025

Spark Streaming with Kinesis mocked on Localstack

In this post we get a Spark streaming application working with AWS Kinesis stream, a mocked version of Kinesis running locally on Localstack. In earlier posts we have explained how to get Localstack running and various AWS services up on Localstack. The client connections to AWS services (Localstack) is done using AWS cli and AWS Java-Sdk v1.

Environment: This set-up continues on a Ubuntu20.04, with Java-8, Maven-3.6x, Docker-24.0x, Python3.5, PySpark/ Spark-2.1.0, Localstack-3.8.1, AWS Java-Sdk-v1 (ver.1.12.778),

Once the Localstack installation is done, steps to follow are:

(I) Start Localstack
    # Start locally
    localstack start

    That should get Localstack should be running on: http://localhost:4566

(II) Check Kinesis services from CLI on Localstack

    # List Streams
    aws --endpoint-url=http://localhost:4566 kinesis list-streams

    # Create Stream
    aws --endpoint-url=http://localhost:4566 kinesis create-stream --stream-name myFirstStream --shard-count 1

    # List Streams
    aws --endpoint-url=http://localhost:4566 kinesis list-streams

    # describe-stream-summary
    aws --endpoint-url=http://localhost:4566 kinesis describe-stream-summary --stream-name myFirstStream

    # Put Record
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"

(III) Connect to Kinesis from Spark Streaming

Download & Build a sample aSpark - Java Kinesis application.
The code is similar to Spark's kinesis-asl from external (connector) module. Except for a few changes to KinesisReceiver.scala onStart() method to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. This enables Localstack endPoint url to be plugged into kinesis, dynamoDb & cloudwatch.

# Build
mvn install -DskipTests=true -Dcheckstyle.skip

# Run JavaKinesisWordCountASL with Localstack

JavaKinesisWordCountASL SampleKinesisApplication myFirstStream http://localhost:4566/

runJavaKinesisWordCountASL.sh script located in sbin/ folder of the aSpark project can be used to run JavaKinesisWordCoundASL from the shell

(IV) Add Data to Localstack Kinesis & View Counts on Console
    a) Put Record from cli
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"

    b) Alternatively Put records from Java Kinesis application
    Download, build & run AmazonKinesisRecordProducerSample.java

    c) Now check the output console of JavaKinesisWordCountASL run in step (III) above. Counts of the words streamed from Localstack Kinesis will be displayed on the console.

Saturday, December 28, 2024

Debugging Spark Scala/ Java components

In continuation to the earlier post regarding debugging Pyspark, here we show how to debug the Spark Scala/ Java side. Spark is a distributed processing environment and has Scala Api's for connecting from different languages like Python & Java. The high level Pyspark Architecture is shown here.

For debugging the Spark Scala/ Java components as these run within the JVM, it's easy to make use of Java Tooling Options for remote debugging from any compatible IDE such as Idea (Eclipse longer supports Scala). A few points to remember:

Multiple JVMs in Spark: Since Spark is a distributed application, it involves several components like the Master/ Driver, Slave/ Worker, Executor. In a real world truly distributed setting, each of the components runs in its own separate JVM on separated Physical machines. So be clear about which component you are exactly wanting to debug & set up the Tooling options accordingly targetting the specific JVM instance.

Two-way connectivity between IDE & JVM: At the same time there should be a two-way network connectivity between the IDE (debugger) & the running JVM instance

Debugging Locally: Debugging is mostly a dev stage activity & done locally. So it may be better to debug on a a Spark cluster running locally. This could be either on a Spark Spark cluster or a Spark run locally (master=local[n]/ local[*]).

Steps:

Environment: Ubuntu-20.04 having Java-8, Spark/Pyspark (ver 2.1.0), Python3.5, Idea-Intelli (ver 2024.3), Maven3.6

(I) Idea Remote JVM Debugger
In Idea > Run/ Debug Config > Edit > Remote JVM Debug.

Start Debugger in Listen to Remote JVM Mode
Enable Auto Restart

(II)(a) Debug Spark Standlone cluster
Key features of the Spark Standalone cluster are:

Separate JVMs for Master, Slave/ Worker, Executor
All could run on a single dev box, provided enough resources (Mem, CPU) are available
Scripts inside SPARK_HOME/sbin folder like start-master.sh, start-slave.sh (start-worker.sh), etc to start the services

In order to Debug lets say some Executor, a Spark Standalone cluster could be started off with 1 Master, 1 Worker, 1 Executor.

   # Start Master (Check http://localhost:8080/ to get Master URL/ PORT)
   ./sbin/start-master.sh

   # Start Slave/ Worker
    ./sbin/start-slave.sh spark://MASTER_URL:<MASTER_PORT>

   # Add Jvm tooling to extraJavaOption to spark-defaults.conf
   spark.executor.extraJavaOptions -agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=n

   # The value could instead be passed as a conf to SparkContext in Python script:
    from pyspark.conf import SparkConf
    confVals = SparkConf()
    confVals.set("spark.executor.extraJavaOptions","-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=y")
    sc = SparkContext(master="spark://localhost:7077",appName="PythonStreamingStatefulNetworkWordCount1",conf=confVals)

(II)(b) Debug locally with master="local[n]"

In this case a local Spark cluster is spun up via scripts like spark-shell, spark-submit, etc. located inside the bin/ folder
The different components Master, Worker, Executor all run within one JVM as threads, where the value n is the no of threads, (set n=2)
Export JAVA_TOOL_OPTIONS before in the terminal from which the Pyspark script will be run

export JAVA_TOOL_OPTIONS="-agentlib:jdwp=transport=dt_socket,server=n,suspend=n,address=5005"

(III) Execute PySpark Python script
python3.5 ${SPARK_HOME}/examples/src/main/python/streaming/network_wordcount.py localhost 9999

This should start off the Pyspark & connect the Executor JVM to the waiting Idea Remote debugger instance for debugging.

Thursday, December 26, 2024

Debugging Pyspark in Eclipse with PyDev

An earlier post shows how to run Pyspark (Spark 2.1.0) in Eclipse (ver 2024-06 (4.32)) using the PyDev (ver 12.1) plugin. The OS is Ubuntu-20.04 with Java-8, & an older version of Python3.5 compatible with PySpark (2.1.0).

While the Pyspark code runs fine within Eclipse, when trying to Debug an error is thrown:

Pydev: Unexpected error setting up the debugger: Socket closed".

This is due to a higher Python requirement (>3.6) for pydevd debugger module within PyDev. Details from the PyDev installations page clearly state that Python3.5 is compatible only with PyDev9.3.0. So it's back to square one.

Install/ replace Pydev 12.1 with PyDev 9.3 in Eclipse

Uninstall Pydev 12.1 (Help > About > Installation details > Installed software > Uninstall PyDev plugin)

Also manually remove all Pydev folders from eclipse/plugins folder (com.python.pydev.* & org.python.pydev.*)

Download & Install PyDev 9.3 from Download location as per instructions

Unzip to eclipse/dropins folder

Restart eclipse & check (Help > About > Installation details > Installed software)

Test debugging Pyspark
Refer to the steps to Run Pyspark on PyDev in Eclipse, & ensure the PyDev Interpreter is python3.5, PYSPARK_PYTHON variable and PYTHONPATH are correctly setup.

Finally, right click on network_wordcount.py > Debug as > Python run
(Set up Debug Configurations > Arguments & provide program arguments, e.g. "localhost 9999", & any breakpoints in the python code to test).

Wednesday, December 25, 2024

Pyspark in Eclipse with PyDev

This post captures the steps to get Spark (ver 2.1) working within Eclipse (ver 2024-06 (4.32)) using the PyDev (ver 12.1) plugin. The OS is Ubuntu-20.04 with Java-8, Python 3.x & Maven 3.6.

(I) Compile Spark code

The Spark code is downloaded & compiled from a location "SPARK_HOME".

export SPARK_HOME="/SPARK/DOWNLOAD/LOCATION"

cd ${SPARK_HOME}

mvn install -DskipTests=true -Dcheckstyle.skip -o

(Issue: For a "Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.8.0:check":
Copy scalastyle-config.xml to the sub-project (next to pom.xml) having the error.

(II) Compile Pyspark
(a) Install Pyspark dependencies

Install Pandoc

sudo apt-get install pandoc

Install a compatible older Pypandoc (ver 1.5)

pip3 install pypandoc==1.5

Install a compatible older Python 3.x (ver 3.5)

sudo add-apt-repository ppa:deadsnakes/ppa

        sudo apt-get install python3.5

    (b) Build Pyspark

cd ${SPARK_HOME}/python

export PYSPARK_PYTHON=python3.5

        # Build - creates ${SPARK_HOME}/python/build
        python3.5 setup.py

        # Dist - creates ${SPARK_HOME}/python/dist
        python3.5 setup.py sdist

    (c) export PYTHON_PATH

    export PYTHONPATH=$PYTHONPATH:${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${SPARK_HOME}/python/pyspark/shell.py;

(III) Run Pyspark from console
    Pyspark setup is done & stanalone examples code should run. Ensure variables ${SPARK_HOME}, ${PYSPARK_PYTHON} & ${PYTHONPATH} are all correctly exported (steps (I), (II)(b) & (II)(c) above):

python3.5 ${SPARK_HOME} /python/build/lib/pyspark/examples/src/main/python/streaming/network_wordcount.py localhost 9999

(IV) Run Pyspark on PyDev in Eclipse

    (a) Eclipse with PyDev plugin installed:
    Set-up tested on Eclipse (ver 2024-06 (4.32.0)) and PyDev plugin (ver 12.1x).

    (b) Import the spark project in Eclipse
    There would be compilation errors due to missing Spark Scala classes.

    (c) Add Target jars for Spark Scala classes
    Eclipse no longer has support for Scala so the corresponding Spark Scala classes are missing. A work around is to add the Scala target jars compiled using mvn (in step (I) above) manually to:

spark-example > Properties > Java Build Path > Libraries

(d) Add PyDev Interpreter for Python3.5
Go to: spark-example > Properties > PyDev - Interpreter/ Grammar > Click to confure an Interpreter not listed > Open Interpreter Preferences Page > New > Choose from List:

& Select /usr/bin/python3.5

On the same page, under the Environment tab add a variable named "PYSPARK_PYTHON" having value "python3.5"

Eclipse - Pydev Interpreter Python3.5 variable

(e) Set up PYTHONPATH for PyDev

spark-example > Properties > PyDev - PYTHONPATH

Under String Substitution Variables add a variable with name "SPARK_HOME" & value "/SPARK/DOWNLOAD/LOCATION" (same location added in Step (I)).

Under External Libraries, Choose Add based on variable, add 3 entries:

${SPARK_HOME}/python/

${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip

With that Pyspark should be properly set-up within PyDev.

(f) Run Pyspark from Eclipse

Right click on network_wordcount.py > Run as > Python run
(You can further change Run Configurations > Arguments & provide program arguments, e.g. "localhost 9999")

Saturday, November 30, 2024

Scala IDE no more

Sad that Scala IDE for Eclipse is no longer supported. While it was a great to have Scala integrated within Eclipse, guess the headwinds were too strong!

Thursday, November 28, 2024

Working with Moto & Lambci Lambda Docker Images

Next up on Mock for clouds is Moto. Moto is primarily for running tests within the Python ecosystem.

Moto does offer a standalone server mode for a other langauges. General sense was that the standalone Moto server would offer the AWS services which will be accessible from the cli & non-Python SDKs. Gave Moto a shot with the same AWS services tried with Localstack.

(I) Set-up

While installing Moto ran into a couple of dependency conflicts across moto, boto3, botocore, requests, s3transfer & in turn with the installed awscli. With some effort reached a sort of dynamic equillibrium with (installed via pip):

awscli 1.36.11
boto3 1.35.63
botocore 1.35.70
moto 5.0.21
requests 2.32.2
s3transfer 0.10.4

(II) Start Moto Server

   # Start Moto
   moto_server -p3000

   # Start Moto as Docker (Sticking to this option)
   docker run --rm -p 5000:5000 --name moto motoserver/moto:latest

(III) Invoke services on Moto

    (a) S3
    # Create bucket
    aws --endpoint-url=http://localhost:5000 s3 mb s3://test-buck

    # Copy item to bucket
    aws --endpoint-url=http://localhost:5000 s3 cp a1.txt s3://test-buck

    # List bucket
    aws --endpoint-url=http://localhost:5000 s3 ls s3://test-buck

--
    (b) SQS
    # Create queue
    aws --endpoint-url=http://localhost:5000 sqs create-queue --queue-name test-q

    # List queues
    aws --endpoint-url=http://localhost:5000 sqs list-queues

    # Get queue attribute
    aws --endpoint-url=http://localhost:5000 sqs get-queue-attributes --queue-url http://localhost:5000/123456789012/test-q --attribute-names All

--
    (c) IAM
    ## Issue: Moto does a basic check of user role & gives an AccessDeniedException when calling Lambda CreateFunction operation
    ## So have to create a specific IAM role (https://github.com/getmoto/moto/issues/3944#issuecomment-845144036) in Moto for the purpose.

aws iam --region=us-east-1 --endpoint-url=http://localhost:5000 create-role --role-name "lambda-test-role" --assume-role-policy-document "some policy" --path "/lambda-test/"

--
    (d) Lambda
    # Create Java function
    aws --endpoint-url=http://localhost:5000 lambda create-function --function-name test-j-div --zip-file fileb://original-java-basic-1.0-SNAPSHOT.jar --handler example.HandlerDivide::handleRequest --runtime java8.al2 --role arn:aws:iam::123456789012:role/lambda-test/lambda-test-role

    # List functions
    aws --endpoint-url=http://localhost:5000 lambda list-functions

    # Invoke function (Fails!)
    aws --endpoint-url=http://localhost:5000 lambda invoke --function-name test-j-div --payload '[235241,17]' outputJ.txt

    The invoke function fails with the message:
    "WARNING - Unable to parse Docker API response. Defaulting to 'host.docker.internal'
    <class 'json.decoder.JSONDecodeError'>::Expecting value: line 1 column 1 (char 0)
    error running docker: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))".

    Retried this from AWS Java-SDK & for other nodejs & python function but nothing worked. While this remains unsolved for now, check out Lambci docker option next.

(IV) Invoke services on Lambci Lambda Docker Images:

    Moto Lambda docs also mention its dependent docker images from the lambci/lambda & mlupin/docker-lambda (for new ones). Started off with a slightly older java8.al2 docker image from lambci/lambda.

    # Download lambci/lambda:java8.al2
    docker pull lambci/lambda:java8.al2

   # Run lambci/lambda:java8.al2.
    ## Ensure to run from the location which has the unzipped (unjarred) Java code
    ## Here it's run from a folder called data_dir_java which has the unzipped (unjarred) class file folders: com/, example/, META-INF/, net/

    docker run -e DOCKER_LAMBDA_STAY_OPEN=1 -p 9001:9001 -v "$PWD":/var/task:ro,delegated --name lambcijava8al2 lambci/lambda:java8.al2 example.HandlerDivide::handleRequest

    # Invoke Lambda
    aws --endpoint-url=http://localhost:9001 lambda invoke --function-name test-j-div --payload '[235241,17]' outputJ.txt

    This works!

Tuesday, November 26, 2024

AWS Lambda on Localstack using Java-SdK-v1

Continuing with Localstack, next is a closer look into the code to deploy and execute AWS Lambda code on Localstack from AWS Java-Sdk-v1. The localstack-lambda-java-sdk-v1 code uses the same structure used in localstack-aws-sdk-examples & fills in for the missing AWS Lambda bit.

The LambdaService class has 3 primary methods - listFunctions(), createFunction() & invokeFunction(). The static AWSLambda client is setup with Mock credentials and pointing to the Localstack endpoint.

The main() method first creates the function (createFunction()), if it does not exist.

It builds a CreateFunctionRequest object with the handler, runtime, role, etc specified
It also reads the jar file of the Java executable from the resources folder into a FunctionCode object & adds it to the CreateFunctionRequest
Next a call is made to the AWSLambda client createFunction() with the CreateFunctionRequest which hits the running Localstack instance (Localstack set-up explained earlier).

If all goes well, control returns to main() which invokes the listFunctions() to show details of the created Lambda function (& all others functions existing).

Finally, there is call from main() to invokeFunction() method.

Which invokes the recently created function with a InvokeRequest object filled with some test values as the payload.
The response from the invoked function is a InvokeResult object who's payload contains the results of the lambda function computation.

Comments welcome, localstack-lambda-java-sdk-v1 is available to play around!

Monday, November 25, 2024

Getting Localstack Up and Running

In continuation to the earlier post on mocks for clouds, this article does a deep dive into getting up & running with Localstack. This is a consolidation of the steps & best practices shared here, here & here. The Localstack set-up is on a Ubuntu-20.04, with Java-8x, Maven-3.8x, Docker-24.0x.

(I) Set-up

   # Install awscli
    sudo apt-get install awscli

    # Install localstack ver 3.8
      ## Issue1: By default pip pulls in version 4.0, which gives an error:
      ## ERROR: Could not find a version that satisfies the requirement localstack-ext==4.0.0 (from localstack)

python3 -m pip install localstack==3.8.1

    # Add to /etc/hosts
    127.0.0.1    localhost.localstack.cloud
    127.0.0.1    s3.localhost.localstack.cloud

   # Configure AWS from cli
    aws configure
    aws configure set default.region us-east-1
    aws configure set aws_access_key_id test
    aws configure set aws_secret_access_key test

    ## Manually configure AWS
    Add to ~/.aws/config
    endpoint_url = http://localhost:4566

    ## Add mock credentials
    Add to ~/.aws/credentials
    aws_access_key_id = test
    aws_secret_access_key = test

--

    # Download docker images needed by the Lambda function
      ## Issue 2: Do this before hand, Localstack gets stuck
       ## at the download image stage unless it's already available

## Pull java:8.al2
docker pull public.ecr.aws/lambda/java:8.al2

## Pull nodejs (required for other nodejs Lambda functions)
docker pull public.ecr.aws/lambda/nodejs:18

## Check images downloaded
docker image ls

(II) Start Localstack

    # Start locally
    localstack start

    # Start as docker (add '-d' for daemon)
       ## Issue 3: Local directory's mount should be as per sample docker-compose

docker-compose -f docker-compose-localstack.yaml up

    # Localstack up on URL's
    http://localhost:4566
    http://localhost.localstack.cloud:4566

    # Check Localstack Health
    curl http://localhost:4566/_localstack/info
    curl http://localhost:4566/_localstack/health

(III) AWS services on Localstack from CLI

(a) S3
    # Create bucket named "test-buck"
    aws --endpoint-url=http://localhost:4566 s3 mb s3://test-buck

    # Copy item to bucket
    aws --endpoint-url=http://localhost:4566 s3 cp a1.txt s3://test-buck

    # List bucket
    aws --endpoint-url=http://localhost:4566 s3 ls s3://test-buck

(b) Sqs
    # Create queue named "test-q"
    aws --endpoint-url=http://localhost:4566 sqs create-queue --queue-name test-q

    # List queues
    aws --endpoint-url=http://localhost:4566 sqs list-queues

    # Get queue attribute
    aws --endpoint-url=http://localhost:4566 sqs get-queue-attributes --queue-url http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/test-q --attribute-names All

(c) Lambda
    aws --endpoint-url=http://localhost:4566 lambda list-functions

    # Create Java function
    aws --endpoint-url=http://localhost:4566 lambda create-function --function-name test-j-div --zip-file fileb://original-java-basic-1.0-SNAPSHOT.jar --handler example.HandlerDivide::handleRequest --runtime java8.al2 --role arn:aws:iam::000000000000:role/lambda-test

    # List functions
    aws --endpoint-url=http://localhost:4566 lambda list-functions

    # Invoke Java function
    aws --endpoint-url=http://localhost:4566 lambda invoke --function-name test-j-div --payload '[200,9]' outputJ.txt

    # Delete function
    aws --endpoint-url=http://localhost:4566 lambda delete-function --function-name test-j-div

(IV) AWS services on Localstack from Java-SDK

# For S3 & Sqs - localstack-aws-sdk-examples, java sdk

# For Lambda - localstack-lambda-java-sdk-v1

Thursday, November 21, 2024

Killing me softly

With your air. With your smog. With your AQIs. With your chart topping PM levels. Delhi this annual event of yours, wish we could skip!

Familiar noises echoing from the four estates are no balm to the troubled sinuses. They shout at the top of their lungs, we cough & sneeze from the bottom of ours.

Solution, now what's that? From whom, when, where & why? Since one's can't really run away perhaps we need to just hibernate or hide. Better still, grin and bear this way of lieF (sic).

Saturday, November 16, 2024

Mutable Argument Capture with Mockito

There are well known scenarios like caching, pooling, etc wherein object reuse is common. Testing these cases using a framework like Mockito could run into problems. Esp if there's a need to verify the arguments sent by the Caller of a Service, where the Service is mocked.

ArgumentCaptor (mockito) fails because it keeps references to the argument obj, which due to reuse by the caller only have the last/ latest updated value.
The discussion here led to using Void Answer as one possible way to solve the issue. The following (junit-3+, mockito-1.8+, commons-lang-2.5) code explains the details.

1. Service: 
public class Service {
	
	public void serve(MutableInt value) {
		System.out.println("Service.serve(): "+value);
	}

.... 
2. Caller:
public class Caller {
	public void callService(Service service) {
		MutableInt value = new MutableInt();
		value.setValue(1);
		service.serve(value);

		value.setValue(2);
		service.serve(value);
	}
... 
3.Tests:
public class MutableArgsTest extends TestCase{
	
	List<MutableInt> multiValuesWritten;
	
	@Mock
	Service service;
 
	/**
	 * Failure with ArgumentCaptor
	 */
	public void testMutableArgsWithArgCaptorFail() {
		Caller caller = new Caller();
		ArgumentCaptor<MutableInt> valueCaptor = 
ArgumentCaptor.forClass(MutableInt.class);

		caller.callService(service);	
		verify(service,times(2)).serve(valueCaptor.capture());
		
	   // AssertionFailedError: expected:<[1, 2]> but was:<[2, 2]>"
		assertEquals(Arrays.asList(new MutableInt(1), 
new MutableInt(2)),valueCaptor.getAllValues());
	}
 
        /**
	 * Success with Answer
	 */
	public void testMutableArgsWithDoAnswer() {
		Caller caller = new Caller();
		doAnswer(new CaptureArgumentsWrittenAsMutableInt<Void>()).
when(service).serve(any(MutableInt.class));
		
		caller.callService(service);	
		verify(service,times(2)).serve(any(MutableInt.class));

		// Works!
		assertEquals(new MutableInt(1),multiValuesWritten.get(0));
		assertEquals(new MutableInt(2),multiValuesWritten.get(1));
	}
	
	/**
	 * Captures Arguments to the Service.serve() method: 
	 * - Multiple calls to serve() happen from the same caller
	 * - Along with reuse of MutableInt argument objects by the caller 
	 * - Argument value is copied to a new MutableInt object & that's captured
	 * @param <Void>
	 */

	public class CaptureArgumentsWrittenAsMutableInt<Void> implements Answer<Void>{
		public Void answer(InvocationOnMock invocation) {
			Object[] args = invocation.getArguments();
			multiValuesWritten.add(new MutableInt(args[0].toString()));
			return null ;
		}
	}
}