Showing posts with label Java-8. Show all posts
Showing posts with label Java-8. Show all posts

Monday, January 6, 2025

Spark API Categorization

A way to categorize Spark API features:

  • Flow of data is generally across the category swim lanes, from creation of a New Spark Context to reading data using I/O to Filter, Map/ Transform, Reduce/ Agg etc Action.
  • Lazy processing upto Transformation.
  • Steps only get executed once an Action is invoke.
  • Post Actions (Reduce, Collect, etc) there could again be I/O, thus the reverse flow from Action
  • Partition is a cross cutting concern across all layers. For I/O, Transformations, Actions could be across all or a few Partitions.
  • forEach on the Stream could be at either at Transform or Action levels.

The diagram is based on code within various Spark test suites

Thursday, January 2, 2025

Mocked Kinesis (Localstack) with PySpark Streaming

Continuing with the same PySpark (ver 2.1.0, Python3.5, etc.) setup explained in an earlier post. In order to connect to the mocked Kinesis stream on Localstack from PySpark use the kinesis_wordcount_asl.py script located in Spark external/ (connector/) folder.

(a) Update value of master in kinesis_wordcount_asl.py

Update value of master(local[n], spark://localhost:7077, etc) in SparkContext in kinesis_wordcount_asl.py:
    sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl",master="local[2]")

(b) Add aSpark compiled jars to Spark Driver/ Executor Classpath

As explained in step (III) of an earlier post, to work with Localstack a few changes were done to the KinesisReceiver.scala onStart() to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. Accordingly the compiled aSpark jars with the modifications need to be added to Spark Driver/ Executor classpath.

     export aSPARK_PROJ_HOME="/Downlaod/Location/aSpark"
    export SPARK_CLASSPATH="${aSPARK_PROJ_HOME}/target/original-aSpark_1.0-2.1.0.jar:${aSPARK_PROJ_HOME}/target/scala-2.11/classes:${aSPARK_PROJ_HOME}/target/scala-2.11/jars/*"

  •  For Spark Standalone mode: "spark.executor.extraClassPath" needs to be set in either spark-defaults.conf or added as a SparkConf to SparkContext (see (II)(a))

(c) Ensure SPARK_HOME, PYSPARK_PYTHON & PYTHONPATH variables are exported.

(d) Run kinesis_wordcount_asl

    python3.5 ${SPARK_HOME}/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py SampleKinesisApplication myFirstStream http://localhost:4566/ us-east-1

    aws  --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"

  • Count of the words streamed (put) will show up on the kinesis_wordcount_asl console
 

Wednesday, January 1, 2025

Spark Streaming with Kinesis mocked on Localstack

In this post we get a Spark streaming application working with AWS Kinesis stream, a mocked version of Kinesis running locally on Localstack. In earlier posts we have explained how to get Localstack running and various AWS services up on Localstack. The client connections to AWS services (Localstack) is done using AWS cli and AWS Java-Sdk v1.

Environment: This set-up continues on a Ubuntu20.04, with Java-8, Maven-3.6x, Docker-24.0x, Python3.5, PySpark/ Spark-2.1.0, Localstack-3.8.1, AWS Java-Sdk-v1 (ver.1.12.778),

Once the Localstack installation is done, steps to follow are:

(I) Start Localstack
    # Start locally
    localstack start

    That should get Localstack should be running on: http://localhost:4566

(II) Check Kinesis services from CLI on Localstack

    # List Streams
    aws --endpoint-url=http://localhost:4566 kinesis list-streams

    # Create Stream
    aws --endpoint-url=http://localhost:4566 kinesis create-stream --stream-name myFirstStream --shard-count 1

    # List Streams
    aws --endpoint-url=http://localhost:4566 kinesis list-streams

    # describe-stream-summary
    aws --endpoint-url=http://localhost:4566 kinesis describe-stream-summary --stream-name myFirstStream

    # Put Record
    aws  --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
    aws  --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"

(III) Connect to Kinesis from Spark Streaming

    # Build
    mvn install -DskipTests=true -Dcheckstyle.skip

    # Run JavaKinesisWordCountASL with Localstack

  • JavaKinesisWordCountASL SampleKinesisApplication myFirstStream http://localhost:4566/

(IV) Add Data to Localstack Kinesis & View Counts on Console
    a) Put Record from cli
    aws  --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
    aws  --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"

    b) Alternatively Put records from Java Kinesis application
    Download, build & run AmazonKinesisRecordProducerSample.java
    
    c) Now check the output console of JavaKinesisWordCountASL run in step (III) above. Counts of the words streamed from Localstack Kinesis will be displayed on the console.

Saturday, December 28, 2024

Debugging Spark Scala/ Java components

In continuation to the earlier post regarding debugging Pyspark, here we show how to debug the Spark Scala/ Java side. Spark is a distributed processing environment and has Scala Api's for connecting from different languages like Python & Java. The high level Pyspark Architecture is shown here.

For debugging the Spark Scala/ Java components as these run within the JVM, it's easy to make use of Java Tooling Options for remote debugging from any compatible IDE such as Idea (Eclipse longer supports Scala). A few points to remember:

  • Multiple JVMs in Spark: Since Spark is a distributed application, it involves several components like the Master/ Driver, Slave/ Worker, Executor. In a real world truly distributed setting, each of the components runs in its own separate JVM on separated Physical machines. So be clear about which component you are exactly wanting to debug & set up the Tooling options accordingly targetting the specific JVM instance.

  • Two-way connectivity between IDE & JVM: At the same time there should be a two-way network connectivity between the IDE (debugger) & the running JVM instance

  • Debugging Locally: Debugging is mostly a dev stage activity & done locally. So it may be better to debug on a a Spark cluster running locally. This could be either on a Spark Spark cluster or a Spark run locally (master=local[n]/ local[*]).

Steps:

Environment: Ubuntu-20.04 having Java-8, Spark/Pyspark (ver 2.1.0), Python3.5, Idea-Intelli (ver 2024.3), Maven3.6

(I) Idea Remote JVM Debugger
In Idea > Run/ Debug Config > Edit > Remote JVM Debug.

  • Start Debugger in Listen to Remote JVM Mode
  • Enable Auto Restart

(II)(a) Debug Spark Standlone cluster
Key features of the Spark Standalone cluster are:

  • Separate JVMs for Master, Slave/ Worker, Executor
  • All could run on a single dev box, provided enough resources (Mem, CPU) are available
  • Scripts inside SPARK_HOME/sbin folder like start-master.sh, start-slave.sh (start-worker.sh), etc to start the services

In order to Debug lets say some Executor, a Spark Standalone cluster could be started off with 1 Master, 1 Worker, 1 Executor.   

    # Start Master (Check http://localhost:8080/ to get Master URL/ PORT)
    ./sbin/start-master.sh 

    # Start Slave/ Worker
    ./sbin/start-slave.sh spark://MASTER_URL:<MASTER_PORT>

    # Add Jvm tooling to extraJavaOption to spark-defaults.conf
    spark.executor.extraJavaOptions  -agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=n

    # The value could instead be passed as a conf to SparkContext in Python script:
    from pyspark.conf import SparkConf
    confVals = SparkConf()
    confVals.set("spark.executor.extraJavaOptions","-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=y")
    sc = SparkContext(master="spark://localhost:7077",appName="PythonStreamingStatefulNetworkWordCount1",conf=confVals)

(II)(b) Debug locally with master="local[n]"

  • In this case a local Spark cluster is spun up via scripts like spark-shell, spark-submit, etc. located inside the bin/ folder
  • The different components Master, Worker, Executor all run within one JVM as threads, where the value n is the no of threads, (set n=2)
  • Export JAVA_TOOL_OPTIONS before in the terminal from which the Pyspark script will be run

        export JAVA_TOOL_OPTIONS="-agentlib:jdwp=transport=dt_socket,server=n,suspend=n,address=5005"

(III) Execute PySpark Python script
    python3.5 ${SPARK_HOME}/examples/src/main/python/streaming/network_wordcount.py localhost 9999

This should start off the Pyspark & connect the Executor JVM to the waiting Idea Remote debugger instance for debugging.

Wednesday, December 25, 2024

Pyspark in Eclipse with PyDev

This post captures the steps to get Spark (ver 2.1) working within Eclipse (ver 2024-06 (4.32)) using the PyDev (ver 12.1) plugin. The OS is Ubuntu-20.04 with Java-8, Python 3.x & Maven 3.6.

(I) Compile Spark code

The Spark code is downloaded & compiled from a location "SPARK_HOME".

    export SPARK_HOME="/SPARK/DOWNLOAD/LOCATION"

    cd ${SPARK_HOME}

    mvn install -DskipTests=true -Dcheckstyle.skip -o

(Issue: For a "Failed to execute goal org.scalastyle:scalastyle-maven-plugin:0.8.0:check":
Copy scalastyle-config.xml to the sub-project (next to pom.xml) having the error.

(II) Compile Pyspark
    (a) Install Pyspark dependencies

  • Install Pandoc

        sudo apt-get install pandoc

  • Install a compatible older Pypandoc (ver 1.5)

        pip3 install pypandoc==1.5

        sudo add-apt-repository ppa:deadsnakes/ppa

        sudo apt-get install python3.5
    
    (b) Build Pyspark

        cd ${SPARK_HOME}/python

        export PYSPARK_PYTHON=python3.5

        # Build - creates ${SPARK_HOME}/python/build
        python3.5 setup.py

        # Dist - creates ${SPARK_HOME}/python/dist
        python3.5 setup.py sdist

    (c) export PYTHON_PATH

    export PYTHONPATH=$PYTHONPATH:${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${SPARK_HOME}/python/pyspark/shell.py;
    
(III) Run Pyspark from console
    Pyspark setup is done & stanalone examples code should run. Ensure variables ${SPARK_HOME}, ${PYSPARK_PYTHON} & ${PYTHONPATH} are all correctly exported (steps (I), (II)(b) & (II)(c) above):

    python3.5 ${SPARK_HOME} /python/build/lib/pyspark/examples/src/main/python/streaming/network_wordcount.py localhost 9999

(IV) Run Pyspark on PyDev in Eclipse

    (a) Eclipse with PyDev plugin installed:
    Set-up tested on Eclipse (ver 2024-06 (4.32.0)) and PyDev plugin (ver 12.1x).
 
    (b) Import the spark project in Eclipse
    There would be compilation errors due to missing Spark Scala classes.

    (c) Add Target jars for Spark Scala classes
    Eclipse no longer has support for Scala so the corresponding Spark Scala classes are missing. A work around is to add the Scala target jars compiled using mvn (in step (I) above) manually to: 

        spark-example > Properties > Java Build Path > Libraries
   

Eclipse Build Path Add Libraries

    (d) Add PyDev Interpreter for Python3.5
    Go to: spark-example > Properties > PyDev - Interpreter/ Grammar > Click to confure an Interpreter not listed > Open Interpreter Preferences Page > New > Choose from List:  

    & Select /usr/bin/python3.5

 Eclipse - Pydev Interpreter Python3.5

 On the same page, under the Environment tab add a variable named "PYSPARK_PYTHON" having value "python3.5"

Eclipse - Pydev Interpreter Python3.5 variable

    (e) Set up PYTHONPATH for PyDev

    spark-example > Properties > PyDev - PYTHONPATH

  • Under String Substitution Variables add a variable with name "SPARK_HOME" & value "/SPARK/DOWNLOAD/LOCATION" (same location added in Step (I)). 
 
Eclipse - Pydev PYTHONPATH variable

  • Under External Libraries, Choose Add based on variable, add 3 entries:

                 ${SPARK_HOME}/python/

                 ${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip

                 ${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip

   With that Pyspark should be properly set-up within PyDev.

    (f) Run Pyspark from Eclipse

    Right click on network_wordcount.py > Run as > Python run
    (You can further change Run Configurations > Arguments & provide program arguments, e.g. "localhost 9999")


Thursday, November 28, 2024

Working with Moto & Lambci Lambda Docker Images

Next up on Mock for clouds is Moto. Moto is primarily for running tests within the Python ecosystem.

Moto does offer a standalone server mode for a other langauges. General sense was that the standalone Moto server would offer the AWS services which will be accessible from the cli & non-Python SDKs. Gave Moto a shot with the same AWS services tried with Localstack.

(I) Set-up

While installing Moto ran into a couple of dependency conflicts across moto, boto3, botocore, requests, s3transfer & in turn with the installed awscli. With some effort reached a sort of dynamic equillibrium with (installed via pip):

  • awscli                       1.36.11             
  • boto3                        1.35.63             
  • botocore                   1.35.70             
  • moto                         5.0.21              
  • requests                   2.32.2                          
  • s3transfer                0.10.4  


(II) Start Moto Server

    # Start Moto
    moto_server -p3000

    # Start Moto as Docker (Sticking to this option)
    docker run --rm -p 5000:5000 --name moto motoserver/moto:latest

(III) Invoke services on Moto

    (a) S3
    # Create bucket
    aws --endpoint-url=http://localhost:5000 s3 mb s3://test-buck

    # Copy item to bucket
    aws --endpoint-url=http://localhost:5000 s3 cp a1.txt s3://test-buck

    # List bucket
    aws --endpoint-url=http://localhost:5000 s3 ls s3://test-buck

--
    (b) SQS
    # Create queue
    aws --endpoint-url=http://localhost:5000 sqs create-queue --queue-name test-q

    # List queues
    aws --endpoint-url=http://localhost:5000 sqs list-queues

    # Get queue attribute
    aws --endpoint-url=http://localhost:5000 sqs get-queue-attributes --queue-url http://localhost:5000/123456789012/test-q --attribute-names All

--
    (c) IAM
    ## Issue: Moto does a basic check of user role & gives an AccessDeniedException when calling Lambda CreateFunction operation
    ## So have to create a specific IAM role (https://github.com/getmoto/moto/issues/3944#issuecomment-845144036) in Moto for the purpose.

    aws iam --region=us-east-1 --endpoint-url=http://localhost:5000 create-role --role-name "lambda-test-role" --assume-role-policy-document "some policy" --path "/lambda-test/"

--
    (d) Lambda
    # Create Java function

    aws --endpoint-url=http://localhost:5000 lambda create-function --function-name test-j-div --zip-file fileb://original-java-basic-1.0-SNAPSHOT.jar --handler example.HandlerDivide::handleRequest --runtime java8.al2 --role arn:aws:iam::123456789012:role/lambda-test/lambda-test-role

    # List functions
    aws --endpoint-url=http://localhost:5000 lambda list-functions

    # Invoke function (Fails!)
    aws --endpoint-url=http://localhost:5000 lambda invoke --function-name test-j-div --payload '[235241,17]' outputJ.txt

    The invoke function fails with the message:
    "WARNING - Unable to parse Docker API response. Defaulting to 'host.docker.internal'
    <class 'json.decoder.JSONDecodeError'>::Expecting value: line 1 column 1 (char 0)
    error running docker: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))".
    
    Retried this from AWS Java-SDK & for other nodejs & python function but nothing worked. While this remains unsolved for now, check out Lambci docker option next.

(IV) Invoke services on Lambci Lambda Docker Images:

    Moto Lambda docs also mention its dependent docker images from the lambci/lambda & mlupin/docker-lambda (for new ones). Started off with a slightly older java8.al2 docker image from lambci/lambda.

    # Download lambci/lambda:java8.al2
    docker pull lambci/lambda:java8.al2
    
    # Run lambci/lambda:java8.al2.   
    ## Ensure to run from the location which has the unzipped (unjarred) Java code
    ## Here it's run from a folder called data_dir_java which has the unzipped (unjarred) class file folders: com/, example/, META-INF/, net/ 

    docker run -e DOCKER_LAMBDA_STAY_OPEN=1 -p 9001:9001 -v "$PWD":/var/task:ro,delegated --name lambcijava8al2 lambci/lambda:java8.al2 example.HandlerDivide::handleRequest

    # Invoke Lambda
    aws --endpoint-url=http://localhost:9001 lambda invoke --function-name test-j-div --payload '[235241,17]' outputJ.txt

    This works!
 

Tuesday, November 26, 2024

AWS Lambda on Localstack using Java-SdK-v1

Continuing with Localstack, next is a closer look into the code to deploy and execute AWS Lambda code on Localstack from AWS Java-Sdk-v1. The localstack-lambda-java-sdk-v1 code uses the same structure used in localstack-aws-sdk-examples & fills in for the missing AWS Lambda bit.

The LambdaService class has 3 primary methods - listFunctions(), createFunction() & invokeFunction().  The static AWSLambda client is setup with Mock credentials and pointing to the Localstack endpoint.
 
The main() method first creates the function (createFunction()), if it does not exist.

  • It builds a CreateFunctionRequest object with the handler, runtime, role, etc specified
  • It also reads the jar file of the Java executable from the resources folder into a FunctionCode object & adds it to the CreateFunctionRequest
  • Next a call is made to the AWSLambda client createFunction() with the CreateFunctionRequest which hits the running Localstack instance (Localstack set-up explained earlier).


If all goes well, control returns to main() which invokes the listFunctions() to show details of the created Lambda function (& all others functions existing).

Finally, there is call from main() to invokeFunction() method.

  • Which invokes the recently created function with a InvokeRequest object filled with some test values as the payload.
  • The response from the invoked function is a InvokeResult object who's payload contains the results of the lambda function computation.

Comments welcome, localstack-lambda-java-sdk-v1 is available to play around!

 

Saturday, November 16, 2024

Mutable Argument Capture with Mockito

There are well known scenarios like caching, pooling, etc wherein object reuse is common. Testing these cases using a framework like Mockito could run into problems. Esp if there's a need to verify the arguments sent by the Caller of a Service, where the Service is mocked.

ArgumentCaptor (mockito) fails because it keeps references to the argument obj, which due to reuse by the caller only have the last/ latest updated value.    
The discussion here led to using Void Answer as one possible way to solve the issue. The following (junit-3+, mockito-1.8+, commons-lang-2.5) code explains the details.

1. Service: 

public class Service {

public void serve(MutableInt value) {

System.out.println("Service.serve(): "+value);

}


....

 

2. Caller:

public class Caller {

public void callService(Service service) {

MutableInt value = new MutableInt();

value.setValue(1);

service.serve(value);


value.setValue(2);

service.serve(value);

}

...

 

3.Tests:

public class MutableArgsTest extends TestCase{

List<MutableInt> multiValuesWritten;

@Mock

Service service;

 

/**

* Failure with ArgumentCaptor

*/

public void testMutableArgsWithArgCaptorFail() {

Caller caller = new Caller();

ArgumentCaptor<MutableInt> valueCaptor

ArgumentCaptor.forClass(MutableInt.class);


caller.callService(service);

verify(service,times(2)).serve(valueCaptor.capture());

// AssertionFailedError: expected:<[1, 2]> but was:<[2, 2]>"

assertEquals(Arrays.asList(new MutableInt(1), 

new MutableInt(2)),valueCaptor.getAllValues());

}

 

        /**

* Success with Answer

*/

public void testMutableArgsWithDoAnswer() {

Caller caller = new Caller();

doAnswer(new CaptureArgumentsWrittenAsMutableInt<Void>()).

when(service).serve(any(MutableInt.class));

caller.callService(service);

verify(service,times(2)).serve(any(MutableInt.class));


// Works!

assertEquals(new MutableInt(1),multiValuesWritten.get(0));

assertEquals(new MutableInt(2),multiValuesWritten.get(1));

}

/**

* Captures Arguments to the Service.serve() method:

* - Multiple calls to serve() happen from the same caller

* - Along with reuse of MutableInt argument objects by the caller

* - Argument value is copied to a new MutableInt object & that's captured

* @param <Void>

*/


public class CaptureArgumentsWrittenAsMutableInt<Void> implements Answer<Void>{

public Void answer(InvocationOnMock invocation) {

Object[] args = invocation.getArguments();

multiValuesWritten.add(new MutableInt(args[0].toString()));

return null ;

}

}

}

Sunday, August 25, 2024

Java Versions & Features

Visual summary of Java Features added since Java 9. Feature clusters show the focus areas over the years. 

  • Initially (ver 9+) focus was on adding some scripting type features & stabilizing the big ticket features added previously.
  • GC & Performance was in focus through the next several versions.
  • Patterns with Switch, InstanceOf, Type, etc have comein since ver 11+.
  • From 14+ Foreign Memory, Vector API, Unix Socket, etc various performant direct host I/O/ parallel computing features have made it in.
  • Some syntactic additions like Module import, When clause, etc are part of the more recent releases.

References:

  • Diagram's datasheet
  • https://medium.com/@chandantechie/comprehensive-list-of-java-versions-with-key-features-and-upcoming-releases-54be35646cca
  • https://docs.oracle.com/en/java/javase/23/language/java-language-changes.html#GUID-6459681C-6881-45D8-B0DB-395D1BD6DB9B
  • https://en.wikipedia.org/wiki/Java_version_history
  • https://www.marcobehler.com/guides/a-guide-to-java-versions-and-features#_java_features_8_20
  • https://www.javatpoint.com/java-versions
  • https://howtodoinjava.com/series/java-versions-features/


Monday, August 19, 2024

Pygradle for Python-3

Gradle, the build workhorse from the Java ecosystem, extends its support to Python through Pygradle. A recent attempt to build a Python-3.x project using Pygradle though did't work as expected. 

The delta between the supported Python-2.x vs Python-3.x is hard to reconcile with many issues like:

  • Need for a specific, old version of Java (ver.8), Gradle (ver. 5.0), etc
  • Dependencies on old versions of Python modules without backwards compatibility
    • Hard to figure out which exact version will work
    • A rule of thumb is to pick the highest version dependency module around some cut-off year like 2018/19, post which they don't seem to build
  • Downloading of the correct dependencies & creating ivy files
    • Includes identifying the right version, name, dependencies-within-dependencies (that no longer work on Python-3.x), etc.
  • Using a local file system based repo to download & build modules & ivy files

With some effort though, have been able to complete a successful build on a Python-3.8 on an Ubuntu-20.04 with Java-8 & Gradle-5.0. More details are available on the pygradle_python3_example repo. Hope this helps!