Algorithms, Design, Code and more: AWS

Showing posts with label AWS. Show all posts

Thursday, January 2, 2025

Mocked Kinesis (Localstack) with PySpark Streaming

Continuing with the same PySpark (ver 2.1.0, Python3.5, etc.) setup explained in an earlier post. In order to connect to the mocked Kinesis stream on Localstack from PySpark use the kinesis_wordcount_asl.py script located in Spark external/ (connector/) folder.

(a) Update value of master in kinesis_wordcount_asl.py

Update value of master(local[n], spark://localhost:7077, etc) in SparkContext in kinesis_wordcount_asl.py:
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl",master="local[2]")

(b) Add aSpark compiled jars to Spark Driver/ Executor Classpath

As explained in step (III) of an earlier post, to work with Localstack a few changes were done to the KinesisReceiver.scala onStart() to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. Accordingly the compiled aSpark jars with the modifications need to be added to Spark Driver/ Executor classpath.

For Spark local mode (master="local[n]"): additions to classpath can be exported in the SPARK_CLASSPATH variable.

export aSPARK_PROJ_HOME="/Downlaod/Location/aSpark"
export SPARK_CLASSPATH="${aSPARK_PROJ_HOME}/target/original-aSpark_1.0-2.1.0.jar:${aSPARK_PROJ_HOME}/target/scala-2.11/classes:${aSPARK_PROJ_HOME}/target/scala-2.11/jars/*"

For Spark Standalone mode: "spark.executor.extraClassPath" needs to be set in either spark-defaults.conf or added as a SparkConf to SparkContext (see (II)(a))

(c) Ensure SPARK_HOME, PYSPARK_PYTHON & PYTHONPATH variables are exported.

(d) Run kinesis_wordcount_asl

python3.5 ${SPARK_HOME}/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py SampleKinesisApplication myFirstStream http://localhost:4566/ us-east-1

put-records to Localstack Kinesis

aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"

Count of the words streamed (put) will show up on the kinesis_wordcount_asl console

Wednesday, January 1, 2025

Spark Streaming with Kinesis mocked on Localstack

In this post we get a Spark streaming application working with AWS Kinesis stream, a mocked version of Kinesis running locally on Localstack. In earlier posts we have explained how to get Localstack running and various AWS services up on Localstack. The client connections to AWS services (Localstack) is done using AWS cli and AWS Java-Sdk v1.

Environment: This set-up continues on a Ubuntu20.04, with Java-8, Maven-3.6x, Docker-24.0x, Python3.5, PySpark/ Spark-2.1.0, Localstack-3.8.1, AWS Java-Sdk-v1 (ver.1.12.778),

Once the Localstack installation is done, steps to follow are:

(I) Start Localstack
    # Start locally
    localstack start

    That should get Localstack should be running on: http://localhost:4566

(II) Check Kinesis services from CLI on Localstack

    # List Streams
    aws --endpoint-url=http://localhost:4566 kinesis list-streams

    # Create Stream
    aws --endpoint-url=http://localhost:4566 kinesis create-stream --stream-name myFirstStream --shard-count 1

    # List Streams
    aws --endpoint-url=http://localhost:4566 kinesis list-streams

    # describe-stream-summary
    aws --endpoint-url=http://localhost:4566 kinesis describe-stream-summary --stream-name myFirstStream

    # Put Record
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"

(III) Connect to Kinesis from Spark Streaming

Download & Build a sample aSpark - Java Kinesis application.
The code is similar to Spark's kinesis-asl from external (connector) module. Except for a few changes to KinesisReceiver.scala onStart() method to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. This enables Localstack endPoint url to be plugged into kinesis, dynamoDb & cloudwatch.

# Build
mvn install -DskipTests=true -Dcheckstyle.skip

# Run JavaKinesisWordCountASL with Localstack

JavaKinesisWordCountASL SampleKinesisApplication myFirstStream http://localhost:4566/

runJavaKinesisWordCountASL.sh script located in sbin/ folder of the aSpark project can be used to run JavaKinesisWordCoundASL from the shell

(IV) Add Data to Localstack Kinesis & View Counts on Console
    a) Put Record from cli
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
    aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"

    b) Alternatively Put records from Java Kinesis application
    Download, build & run AmazonKinesisRecordProducerSample.java

    c) Now check the output console of JavaKinesisWordCountASL run in step (III) above. Counts of the words streamed from Localstack Kinesis will be displayed on the console.

Thursday, November 28, 2024

Working with Moto & Lambci Lambda Docker Images

Next up on Mock for clouds is Moto. Moto is primarily for running tests within the Python ecosystem.

Moto does offer a standalone server mode for a other langauges. General sense was that the standalone Moto server would offer the AWS services which will be accessible from the cli & non-Python SDKs. Gave Moto a shot with the same AWS services tried with Localstack.

(I) Set-up

While installing Moto ran into a couple of dependency conflicts across moto, boto3, botocore, requests, s3transfer & in turn with the installed awscli. With some effort reached a sort of dynamic equillibrium with (installed via pip):

awscli 1.36.11
boto3 1.35.63
botocore 1.35.70
moto 5.0.21
requests 2.32.2
s3transfer 0.10.4

(II) Start Moto Server

   # Start Moto
   moto_server -p3000

   # Start Moto as Docker (Sticking to this option)
   docker run --rm -p 5000:5000 --name moto motoserver/moto:latest

(III) Invoke services on Moto

    (a) S3
    # Create bucket
    aws --endpoint-url=http://localhost:5000 s3 mb s3://test-buck

    # Copy item to bucket
    aws --endpoint-url=http://localhost:5000 s3 cp a1.txt s3://test-buck

    # List bucket
    aws --endpoint-url=http://localhost:5000 s3 ls s3://test-buck

--
    (b) SQS
    # Create queue
    aws --endpoint-url=http://localhost:5000 sqs create-queue --queue-name test-q

    # List queues
    aws --endpoint-url=http://localhost:5000 sqs list-queues

    # Get queue attribute
    aws --endpoint-url=http://localhost:5000 sqs get-queue-attributes --queue-url http://localhost:5000/123456789012/test-q --attribute-names All

--
    (c) IAM
    ## Issue: Moto does a basic check of user role & gives an AccessDeniedException when calling Lambda CreateFunction operation
    ## So have to create a specific IAM role (https://github.com/getmoto/moto/issues/3944#issuecomment-845144036) in Moto for the purpose.

aws iam --region=us-east-1 --endpoint-url=http://localhost:5000 create-role --role-name "lambda-test-role" --assume-role-policy-document "some policy" --path "/lambda-test/"

--
    (d) Lambda
    # Create Java function
    aws --endpoint-url=http://localhost:5000 lambda create-function --function-name test-j-div --zip-file fileb://original-java-basic-1.0-SNAPSHOT.jar --handler example.HandlerDivide::handleRequest --runtime java8.al2 --role arn:aws:iam::123456789012:role/lambda-test/lambda-test-role

    # List functions
    aws --endpoint-url=http://localhost:5000 lambda list-functions

    # Invoke function (Fails!)
    aws --endpoint-url=http://localhost:5000 lambda invoke --function-name test-j-div --payload '[235241,17]' outputJ.txt

    The invoke function fails with the message:
    "WARNING - Unable to parse Docker API response. Defaulting to 'host.docker.internal'
    <class 'json.decoder.JSONDecodeError'>::Expecting value: line 1 column 1 (char 0)
    error running docker: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))".

    Retried this from AWS Java-SDK & for other nodejs & python function but nothing worked. While this remains unsolved for now, check out Lambci docker option next.

(IV) Invoke services on Lambci Lambda Docker Images:

    Moto Lambda docs also mention its dependent docker images from the lambci/lambda & mlupin/docker-lambda (for new ones). Started off with a slightly older java8.al2 docker image from lambci/lambda.

    # Download lambci/lambda:java8.al2
    docker pull lambci/lambda:java8.al2

   # Run lambci/lambda:java8.al2.
    ## Ensure to run from the location which has the unzipped (unjarred) Java code
    ## Here it's run from a folder called data_dir_java which has the unzipped (unjarred) class file folders: com/, example/, META-INF/, net/

    docker run -e DOCKER_LAMBDA_STAY_OPEN=1 -p 9001:9001 -v "$PWD":/var/task:ro,delegated --name lambcijava8al2 lambci/lambda:java8.al2 example.HandlerDivide::handleRequest

    # Invoke Lambda
    aws --endpoint-url=http://localhost:9001 lambda invoke --function-name test-j-div --payload '[235241,17]' outputJ.txt

    This works!

Tuesday, November 26, 2024

AWS Lambda on Localstack using Java-SdK-v1

Continuing with Localstack, next is a closer look into the code to deploy and execute AWS Lambda code on Localstack from AWS Java-Sdk-v1. The localstack-lambda-java-sdk-v1 code uses the same structure used in localstack-aws-sdk-examples & fills in for the missing AWS Lambda bit.

The LambdaService class has 3 primary methods - listFunctions(), createFunction() & invokeFunction(). The static AWSLambda client is setup with Mock credentials and pointing to the Localstack endpoint.

The main() method first creates the function (createFunction()), if it does not exist.

It builds a CreateFunctionRequest object with the handler, runtime, role, etc specified
It also reads the jar file of the Java executable from the resources folder into a FunctionCode object & adds it to the CreateFunctionRequest
Next a call is made to the AWSLambda client createFunction() with the CreateFunctionRequest which hits the running Localstack instance (Localstack set-up explained earlier).

If all goes well, control returns to main() which invokes the listFunctions() to show details of the created Lambda function (& all others functions existing).

Finally, there is call from main() to invokeFunction() method.

Which invokes the recently created function with a InvokeRequest object filled with some test values as the payload.
The response from the invoked function is a InvokeResult object who's payload contains the results of the lambda function computation.

Comments welcome, localstack-lambda-java-sdk-v1 is available to play around!

Monday, November 25, 2024

Getting Localstack Up and Running

In continuation to the earlier post on mocks for clouds, this article does a deep dive into getting up & running with Localstack. This is a consolidation of the steps & best practices shared here, here & here. The Localstack set-up is on a Ubuntu-20.04, with Java-8x, Maven-3.8x, Docker-24.0x.

(I) Set-up

   # Install awscli
    sudo apt-get install awscli

    # Install localstack ver 3.8
      ## Issue1: By default pip pulls in version 4.0, which gives an error:
      ## ERROR: Could not find a version that satisfies the requirement localstack-ext==4.0.0 (from localstack)

python3 -m pip install localstack==3.8.1

    # Add to /etc/hosts
    127.0.0.1    localhost.localstack.cloud
    127.0.0.1    s3.localhost.localstack.cloud

   # Configure AWS from cli
    aws configure
    aws configure set default.region us-east-1
    aws configure set aws_access_key_id test
    aws configure set aws_secret_access_key test

    ## Manually configure AWS
    Add to ~/.aws/config
    endpoint_url = http://localhost:4566

    ## Add mock credentials
    Add to ~/.aws/credentials
    aws_access_key_id = test
    aws_secret_access_key = test

--

    # Download docker images needed by the Lambda function
      ## Issue 2: Do this before hand, Localstack gets stuck
       ## at the download image stage unless it's already available

## Pull java:8.al2
docker pull public.ecr.aws/lambda/java:8.al2

## Pull nodejs (required for other nodejs Lambda functions)
docker pull public.ecr.aws/lambda/nodejs:18

## Check images downloaded
docker image ls

(II) Start Localstack

    # Start locally
    localstack start

    # Start as docker (add '-d' for daemon)
       ## Issue 3: Local directory's mount should be as per sample docker-compose

docker-compose -f docker-compose-localstack.yaml up

    # Localstack up on URL's
    http://localhost:4566
    http://localhost.localstack.cloud:4566

    # Check Localstack Health
    curl http://localhost:4566/_localstack/info
    curl http://localhost:4566/_localstack/health

(III) AWS services on Localstack from CLI

(a) S3
    # Create bucket named "test-buck"
    aws --endpoint-url=http://localhost:4566 s3 mb s3://test-buck

    # Copy item to bucket
    aws --endpoint-url=http://localhost:4566 s3 cp a1.txt s3://test-buck

    # List bucket
    aws --endpoint-url=http://localhost:4566 s3 ls s3://test-buck

(b) Sqs
    # Create queue named "test-q"
    aws --endpoint-url=http://localhost:4566 sqs create-queue --queue-name test-q

    # List queues
    aws --endpoint-url=http://localhost:4566 sqs list-queues

    # Get queue attribute
    aws --endpoint-url=http://localhost:4566 sqs get-queue-attributes --queue-url http://sqs.us-east-1.localhost.localstack.cloud:4566/000000000000/test-q --attribute-names All

(c) Lambda
    aws --endpoint-url=http://localhost:4566 lambda list-functions

    # Create Java function
    aws --endpoint-url=http://localhost:4566 lambda create-function --function-name test-j-div --zip-file fileb://original-java-basic-1.0-SNAPSHOT.jar --handler example.HandlerDivide::handleRequest --runtime java8.al2 --role arn:aws:iam::000000000000:role/lambda-test

    # List functions
    aws --endpoint-url=http://localhost:4566 lambda list-functions

    # Invoke Java function
    aws --endpoint-url=http://localhost:4566 lambda invoke --function-name test-j-div --payload '[200,9]' outputJ.txt

    # Delete function
    aws --endpoint-url=http://localhost:4566 lambda delete-function --function-name test-j-div

(IV) AWS services on Localstack from Java-SDK

# For S3 & Sqs - localstack-aws-sdk-examples, java sdk

# For Lambda - localstack-lambda-java-sdk-v1

Monday, August 12, 2024

To Mock a Cloud

Cloud hosting has been the norm for a while now. Saas, Paas, Iaas, serverless, AI whatever the form may be, organizations (org) need to have a digital presence on the cloud.

Cloud vendors offer hundreds of features and services such as 24x7 availability, fail-safe, load-balanced, auto-scaling, disaster resilient distributed, edge-compute, AI/ Ml clusters, LLMs, Search, Database, Datawarehouses among many others right off-the-shelf. They additionally provide a pay-as-you-go model only for the services being used. Essentially everything that any org could ask for today & in the future!

But it's not all rosy. The cloud bill (even though pay-as-you-go) does burn a hole in the pockets. While expenses for the live production (prod) environment is necessary, costs for the other dev, test, etc, internal environments could be largely reduced by replacing the real Cloud with a Mock Cloud. This would additionally, speed up dev and deployment times and make bug fixes and devops much quicker & streamlined.

As dev's know mocks, emulators, etc are only as good as their implementation - how true they are to the real thing. It's a pain to find new/ unknown bugs on the prod only because it's an env very different from dev/ test. Which dev worth his weight in salt (or gold!) hasn't seen this ever?

While using containers to mock up cloud services was the traditional way of doing it, a couple of recent initiatives like Localstack, Moto, etc seem promising. Though AWS focussed for now, support for others are likely soon. Various AWS services like s3, sns, sqs, ses, lambda, etc are already supported at different levels of maturity. So go explore mocks for cloud & happy coding!

Tuesday, September 5, 2017

Spark Streaming + AWS

Fundamentals

https://aws.amazon.com/blogs/big-data/optimize-spark-streaming-to-efficiently-process-amazon-kinesis-streams/
https://aws.amazon.com/kinesis/streams/faqs/
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

Spark Streaming - NRT:

Very low values for batchInterval (~10ms)
blockInterval = batchInterval
TBD!

Wednesday, April 17, 2013

Upload to Amazon S3 Bucket via Signed Url with Server Side Encryption

Continuing further from my previous post on upload & download from Amazon S3 bucket via signed url's, here is how to enable Server Side Encryption (SES) with the file being uploaded to S3.

Add a x-amz-server-side-encryption request parameter with the GeneratePresignedUrlRequest before getting the signed url:

Wednesday, March 20, 2013

Uploading Large Files In Chunks To Amazon S3

A collection of best practices based on my experience building a scaled out solution for the server side file upload handler.

1. Authentication/ Authorization

2. Chunking

3. Stateless upload & Session

4. Shared memory for post file operations

5. Retries & Failover

6. Bulk operations

To be completed..

Wednesday, February 6, 2013

Mocking AWS ELB Behaviour Locally For Testing

Once hosted out of Amazon, you make use of the AWS Elastic Load Balancer (ELB) for balancing load across your EC2's within or acroos Availability Zones (AZ). Since code gets developed and tested locally (outside of Amazon), at times you might want to test load balancer scenarios before deploying to production. Here's one way to mock up the load balancer behaviour for local testing.

Use Apache (you could very well use something like Nginx instead) in a reverse proxy, load balancer set up via mod_proxy & mod_proxy_balancer. Fairly simple for anyone with slight experience with configuring Apache. We used Apache as a load balancer front-end to IIS on local, exactly the way ELB would load balance in front of production IIS.

Additionally, since ELB was also an SSL end point for our production servers, we set up Apache to be the SSL end point (via mod_ssl) on local. Apache was configured to listen on port 443 (using a self-signed certificate), and would forward all traffic from port 443 to backend IIS on port 80.

Once we had that set-up going, we were quickly able to reproduce an issue with application generated Secure cookies not getting set properly across client request/ response. Once we had the fix on the local (which was to set the flag on the cookies in the request, not response) the same worked flawlessly on the AWS as well.

Wednesday, October 10, 2012

Brewer's CAP Theorem

Brewer's CAP theorem talks about Consistency (C), Availability (A), Partition (P) tolerance, as the constraints that primarily govern the design of all distributed systems. There's a lot of literature available online explaining the theorem. The summary is that given that network partitions (P) will happen, pick one of the other two - Consistency (C) or Availability (A) for designing your system on a case by case basis (since you can't have all three)!

A partition could be caused by the failure of some kind of component - hardware (routers, gateway, cables, physical boxes/ nodes, disks, etc.) and/or software. When that happens:

- If you pick Consistency (C) => All your systems, processing, etc. is blocked/ held up until the failed component(s) recovers.

This has been the default with traditional RDBMS (thanks to their being ACID compliant). For financial & banking applications this normally has to be the choice.

- On the other hand, if you pick Availability (A) => All systems, other than the currently partitioned/ failed systems, continue to function as is within their own partitions. Seems good? Well not quite, cause this obviously results in inconsistencies across the two (or more) partitioned sections.

Systems thus designed with Availability (A) as their selection (over C), must be able to live with inconsistencies across different partitions. Such systems also have some automated way to later get back to consistent state (eventual consistency) once the partitioned/ failed systems have recovered.

This is mostly the design choice with the NoSqls. Also with services such as Amazon AWS where eventual consistency within some reasonable time window (of a few seconds to a minutes) is acceptable.