Algorithms, Design, Code and more: Python

Showing posts with label Python. Show all posts

Wednesday, November 26, 2025

Explainable AI

With widespread adoption of large Machine Learning (ML) models, there's a real need for understanding the workings of the models. The model otherwise just appears to be a black-box doing its thing without the end user really knowing the whys/ hows behind the models responses, choices, decisions, etc. Looking inside the model - the white-box approach - while possible is simply not practical for 99.99..9% users.

Local Interpretable Model-Agnostic Explanations (LIME) & Shapley Additive Explanations (SHAP) are two black-box techniques that help explaining the workings of such models. The key idea behind both being:

To generate some (synthetic) input data from actual data with some of the features (such as income, age, etc) of the data altered at random.
Then to use the generated input data with the model and use the output to understand the effects of the altered features (one or more/ combinations) on the output.Thereby, understand the importance/ relevance of the features on the outputs of the model.
For e.g. In a loan approval/ rejection scenario by altering two features income levels & gender in the input and testing one might discover that Income levels has an effect on the decision, but no gender.

With that background, let's look at SHAP for language models that take texts as input. Here features are the words (tokens) that comprise the input string.

For an input like: "Glad to see you",

The features are: "Glad", "to", "see", "you"

Shap would explain the impact of each word (token) on the output of the model by passing in various altered data with words MASKED:
"* to see you", "Glad to * you", ...

TextClassificationTorchShap.py shows how SHAP works with the Text Classification Model trained using the Imdb dataset. The code requires shap to be installed:

pip3 install shap

In terms of its working it loads up the pre-trained Text Classification model and vocabulary. Next it plugs in to the shap library using a shap custom tokenizer to generate token_ids & offsets for the given input data.

masker = maskers.Text(custom_tokenizer, mask_token=SPECIAL_TOKEN_UNK)
explainer = shap.Explainer(predict,masker=masker)

Finally, shap is called with some sample input text which has words masked at random. Shap collects the outputs which can be used to generate a visual report of the impact of the different words as seen below.

The model classifies any given input text as either POSITIVE (score near 1) or NEGATIVE (score near 0). The figure is showing output for two input data: "This is a great one to watch." & "What a long drawn boring affair to the end credits."

Let's look first at "This is a great one to watch.":

There is a base value = 0.539161 which is the model's output for a completely MASKED out input, i.e. "* * * * * * *"
The words "to w..", "This is" move up the score to 0.7
In addition, the words "a great" move up the score to 0.996787, the actual output of the model for the complete input text "This is a great one to watch."
The model rightly classifies this as POSITIVE with a score of 0.996787 (close to 1)

Similarly for the text "What a long drawn boring affair to the end credits.":

Completely masked base value = 0.539161.
The key words in this case are "boring affair to the".
The text is rightly classified as NEGATIVE with a score of 0.0280297 (close to 0).

Saturday, November 22, 2025

Text Classification from Scratch using PyTorch

The AI/ ML development framework Keras 3x in recent times has introduced support for Torch & Jax backends in addition to Tensorflow. However, given Keras's Tensorflow legacy large sections of the code are deeply integrated with Tensorflow.

One such piece of code is the text_classification_from_scratch.py from keras-io/examples project. Without tensorflow this piece of code simply doesn't run!

Here's text_classification_torch.py a Torch/ PyTorch port of the same code. The bits that needed modification were:

Removing all tensorflow related imports
Loading the Imdb text files in "grain" format in place of "tf" format, by passing the appropriate param:

keras.utils->text_dataset_from_directory(format="grain")

Which obviously needed grain to be installed:

pip3 install grain

Using torchtext for Vocab/ Tokenizer/ Vectorizing :

pip3 install torchtext

Few other changes such as ensure max_features constraint's honoured, text is standardized, padded, and so on.

Thursday, October 30, 2025

Langchain4j

LangChain is one of the leading python based AI/ ML, agentic modelling and integration frameworks. Langchain (and allied frameworks like LangGraph) allows integration with almost all LLMs, python libraries and tools out there.

Langchain4j is its Java counterpart. Langchain4j allows LLM integrations and workflows to be built using pure Java constructs. It primarily operates as a Java client to the various api's exposed by the different LLM provides such as OpenAi, Azure, Bedrock, Gemini and so on.

Langchain4j has covered a lot of ground in terms of the supported modules from both the Python and the Java ecosystems. It's actively supported and should be one for the long run..

To get a feel for Langchain4j on a local LLM try out langchain4j-ollama which gets:

Java langchain4j-ollama to talk to

-> Ollama (deployed locally)

-> Hosting the llama3.2:1b model

(I) Get a local Ollama up & running

Refer to the previous post regarding installing/ getting Ollama running locally. Once done, you should have a llama3.2:1b model running & ready to chat locally on:

http://127.0.0.1:11434

(II) Download & build langchain4j-ollama project

Clone langchain4j-ollama project & build:

cd </download/folder/langchain4j-ollama>

mvn install

(III) Run langchain4j-ollama tests

Run a couple of the langchain4j-ollama integration tests. Start with OllamaChatModelIT.java. Make sure to update the Model_Name value to llama3.2:1b downloaded in step (I) above:

static final String MODEL_NAME = "llama3.2:1b";

That's about it for getting the three pieces integrated & chatting!

Friday, March 28, 2025

Streamlit

Streamlit is a web wrapper for Data Science projects in pure Python. It's a lightweight, simple, rapid prototyping web app framework for sharing scripts.

https://streamlit.io/playground
https://www.restack.io/docs/streamlit-knowledge-streamlit-vs-flask-vs-django
https://docs.streamlit.io/develop/concepts/architecture/architecture
https://docs.snowflake.com/en/developer-guide/streamlit/about-streamlit

Thursday, January 2, 2025

Mocked Kinesis (Localstack) with PySpark Streaming

Continuing with the same PySpark (ver 2.1.0, Python3.5, etc.) setup explained in an earlier post. In order to connect to the mocked Kinesis stream on Localstack from PySpark use the kinesis_wordcount_asl.py script located in Spark external/ (connector/) folder.

(a) Update value of master in kinesis_wordcount_asl.py

Update value of master(local[n], spark://localhost:7077, etc) in SparkContext in kinesis_wordcount_asl.py:
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl",master="local[2]")

(b) Add aSpark compiled jars to Spark Driver/ Executor Classpath

As explained in step (III) of an earlier post, to work with Localstack a few changes were done to the KinesisReceiver.scala onStart() to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. Accordingly the compiled aSpark jars with the modifications need to be added to Spark Driver/ Executor classpath.

For Spark local mode (master="local[n]"): additions to classpath can be exported in the SPARK_CLASSPATH variable.

export aSPARK_PROJ_HOME="/Downlaod/Location/aSpark"
export SPARK_CLASSPATH="${aSPARK_PROJ_HOME}/target/original-aSpark_1.0-2.1.0.jar:${aSPARK_PROJ_HOME}/target/scala-2.11/classes:${aSPARK_PROJ_HOME}/target/scala-2.11/jars/*"

For Spark Standalone mode: "spark.executor.extraClassPath" needs to be set in either spark-defaults.conf or added as a SparkConf to SparkContext (see (II)(a))

(c) Ensure SPARK_HOME, PYSPARK_PYTHON & PYTHONPATH variables are exported.

(d) Run kinesis_wordcount_asl

python3.5 ${SPARK_HOME}/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py SampleKinesisApplication myFirstStream http://localhost:4566/ us-east-1

put-records to Localstack Kinesis

aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"

Count of the words streamed (put) will show up on the kinesis_wordcount_asl console

Saturday, December 28, 2024

Debugging Spark Scala/ Java components

In continuation to the earlier post regarding debugging Pyspark, here we show how to debug the Spark Scala/ Java side components. Spark is a distributed processing environment and has Scala Api's for connecting from different languages like Python & Java. The high level Pyspark Architecture is shown here.

For debugging the Spark Scala/ Java components which run within the JVM, it's easy to make use of Java Tooling Options for remote debugging from any compatible IDE such as Idea (Eclipse longer supports Scala). A few points to remember:

Multiple JVMs in Spark: Spark is a distributed application and involves several components like the Master/ Driver, Slave/ Worker, Executor. In a real world truly distributed setting, each of the components runs in its own separate JVM on separated physical machines. So be clear about the component that you are wanting to debug & set up the Tooling options accordingly targeting the specific JVM instance.

Two-way connectivity between IDE & JVM: At the same time there should be a two-way network connectivity between the IDE (debugger) & the running JVM instance

Debugging Locally: Debugging is mostly a dev stage activity & done locally. So it may be better to debug on a Spark cluster running locally. This could be either on a Spark Standalone cluster or a Spark instance run locally (master=local[n]/ local[*]).

Steps:

Environment: Ubuntu-20.04 having Java-8, Spark/Pyspark (ver 2.1.0), Python3.5, Idea-Intelli (ver 2024.3), Maven3.6

(I) Idea Remote JVM Debugger
In Idea > Run/ Debug Config > Edit > Remote JVM Debug.

Start Debugger in Listen to Remote JVM Mode
Enable Auto Restart

(II)(a) Debug Spark Standlone cluster
Key features of the Spark Standalone cluster are:

Separate JVMs for Master, Slave/ Worker, Executor
All of them can be run from a single dev box, provided enough resources (Mem, CPU) are available
Scripts inside SPARK_HOME/sbin folder like start-master.sh, start-slave.sh (start-worker.sh), etc are used to start these Spark services

In order to Debug lets say an Executor, a Spark Standalone cluster could be started off with 1 Master, 1 Worker, 1 Executor.

   # Start Master (Check http://localhost:8080/ to get Master URL/ PORT)
   ./sbin/start-master.sh

   # Start Slave/ Worker
    ./sbin/start-slave.sh spark://MASTER_URL:<MASTER_PORT>

   # Add Jvm tooling to extraJavaOption to spark-defaults.conf
   spark.executor.extraJavaOptions -agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=n

   # The value could instead be passed as a conf to SparkContext in Python script:
    from pyspark.conf import SparkConf
    confVals = SparkConf()
    confVals.set("spark.executor.extraJavaOptions","-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=y")
    sc = SparkContext(master="spark://localhost:7077",appName="PythonStreamingStatefulNetworkWordCount1",conf=confVals)

(II)(b) Debug locally with master="local[n]"

In this case a local Spark cluster is spun up via scripts like spark-shell, spark-submit, etc. located inside the bin/ folder
The different components Master, Slave/ Worker, Executor all run within one JVM as threads, where the value n is the no of threads, (set n=2)
Export JAVA_TOOL_OPTIONS before in the terminal from which the Pyspark script will be run

export JAVA_TOOL_OPTIONS="-agentlib:jdwp=transport=dt_socket,server=n,suspend=n,address=5005"

(III) Execute PySpark Python script
python3.5 ${SPARK_HOME}/examples/src/main/python/streaming/network_wordcount.py localhost 9999

This should start off Pyspark & connect the Executor JVM to the waiting Idea Remote debugger instance for debugging.

Thursday, December 26, 2024

Debugging Pyspark in Eclipse with PyDev

An earlier post shows how to run Pyspark (Spark 2.1.0) in Eclipse (ver 2024-06 (4.32)) using the PyDev (ver 12.1) plugin. The OS is Ubuntu-20.04 with Java-8, & an older version of Python3.5 compatible with PySpark (2.1.0).

While the Pyspark code runs fine within Eclipse, when trying to Debug an error is thrown:

Pydev: Unexpected error setting up the debugger: Socket closed".

This is due to a higher Python requirement (>3.6) for pydevd debugger module within PyDev. Details from the PyDev installations page clearly state that Python3.5 is compatible only with PyDev9.3.0. So it's back to square one!

Install/ replace Pydev 12.1 with PyDev 9.3 in Eclipse

Uninstall Pydev 12.1 (Help > About > Installation details > Installed software > Uninstall PyDev plugin)

Also manually remove all Pydev folders from eclipse/plugins folder (com.python.pydev.* & org.python.pydev.*)

Download & Install PyDev 9.3 from Download location as per instructions

Unzip to eclipse/dropins folder

Restart eclipse & check (Help > About > Installation details > Installed software)

Test debugging Pyspark
Refer to the steps to Run Pyspark on PyDev in Eclipse, & ensure the PyDev Interpreter is python3.5, PYSPARK_PYTHON variable and PYTHONPATH are correctly setup.

Finally, right click on network_wordcount.py > Debug as > Python run
(Set up Debug Configurations > Arguments & provide program arguments, e.g. "localhost 9999", & any breakpoints in the python code to test).

Wednesday, December 25, 2024

Pyspark in Eclipse with PyDev

This post captures the steps to get Spark (ver 2.1) working within Eclipse (ver 2024-06 (4.32)) using the PyDev (ver 12.1) plugin. The OS is Ubuntu-20.04 with Java-8, Python 3.x & Maven 3.6.

(I) Compile Spark code

The Spark code is downloaded & compiled from a location "SPARK_HOME".

export SPARK_HOME="/SPARK/DOWNLOAD/LOCATION"

cd ${SPARK_HOME}

mvn install -DskipTests=true -Dcheckstyle.skip -o

(Issue: For a "Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.8.0:check":
Copy scalastyle-config.xml to the sub-project (next to pom.xml) having the error.

(II) Compile Pyspark
(a) Install Pyspark dependencies

Install Pandoc

sudo apt-get install pandoc

Install a compatible older Pypandoc (ver 1.5)

pip3 install pypandoc==1.5

Install a compatible older Python 3.x (ver 3.5)

sudo add-apt-repository ppa:deadsnakes/ppa

        sudo apt-get install python3.5

    (b) Build Pyspark

cd ${SPARK_HOME}/python

export PYSPARK_PYTHON=python3.5

        # Build - creates ${SPARK_HOME}/python/build
        python3.5 setup.py

        # Dist - creates ${SPARK_HOME}/python/dist
        python3.5 setup.py sdist

    (c) export PYTHON_PATH

    export PYTHONPATH=$PYTHONPATH:${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${SPARK_HOME}/python/pyspark/shell.py;

(III) Run Pyspark from console
    Pyspark setup is done & stanalone examples code should run. Ensure variables ${SPARK_HOME}, ${PYSPARK_PYTHON} & ${PYTHONPATH} are all correctly exported (steps (I), (II)(b) & (II)(c) above):

python3.5 ${SPARK_HOME} /python/build/lib/pyspark/examples/src/main/python/streaming/network_wordcount.py localhost 9999

(IV) Run Pyspark on PyDev in Eclipse

    (a) Eclipse with PyDev plugin installed:
    Set-up tested on Eclipse (ver 2024-06 (4.32.0)) and PyDev plugin (ver 12.1x).

    (b) Import the spark project in Eclipse
    There would be compilation errors due to missing Spark Scala classes.

    (c) Add Target jars for Spark Scala classes
    Eclipse no longer has support for Scala so the corresponding Spark Scala classes are missing. A work around is to add the Scala target jars compiled using mvn (in step (I) above) manually to:

spark-example > Properties > Java Build Path > Libraries

(d) Add PyDev Interpreter for Python3.5
Go to: spark-example > Properties > PyDev - Interpreter/ Grammar > Click to confure an Interpreter not listed > Open Interpreter Preferences Page > New > Choose from List:

& Select /usr/bin/python3.5

On the same page, under the Environment tab add a variable named "PYSPARK_PYTHON" having value "python3.5"

Eclipse - Pydev Interpreter Python3.5 variable

(e) Set up PYTHONPATH for PyDev

spark-example > Properties > PyDev - PYTHONPATH

Under String Substitution Variables add a variable with name "SPARK_HOME" & value "/SPARK/DOWNLOAD/LOCATION" (same location added in Step (I)).

Under External Libraries, Choose Add based on variable, add 3 entries:

${SPARK_HOME}/python/

${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip

With that Pyspark should be properly set-up within PyDev.

(f) Run Pyspark from Eclipse

Right click on network_wordcount.py > Run as > Python run
(You can further change Run Configurations > Arguments & provide program arguments, e.g. "localhost 9999")

Monday, August 19, 2024

Pygradle for Python-3

Gradle, the build workhorse from the Java ecosystem, extends its support to Python through Pygradle. A recent attempt to build a Python-3.x project using Pygradle though did't work as expected.

The delta between the supported Python-2.x vs Python-3.x is hard to reconcile with many issues like:

Need for a specific, old version of Java (ver.8), Gradle (ver. 5.0), etc
Dependencies on old versions of Python modules without backwards compatibility

Hard to figure out which exact version will work
A rule of thumb is to pick the highest version dependency module around some cut-off year like 2018/19, post which they don't seem to build

Downloading of the correct dependencies & creating ivy files

Includes identifying the right version, name, dependencies-within-dependencies (that no longer work on Python-3.x), etc.

Using a local file system based repo to download & build modules & ivy files

With some effort though, have been able to complete a successful build on a Python-3.8 on an Ubuntu-20.04 with Java-8 & Gradle-5.0. More details are available on the pygradle_python3_example repo. Hope this helps!