A book by Damien Benveniste of AIEdge. Though a work in progress, chapters 2 - 4 available for preview are fantastic.
Look forward to a paperback edition, which I certainly hope to own...
Insights on Java, Big Data, Search, Cloud, Algorithms, Data Science, Machine Learning...
A book by Damien Benveniste of AIEdge. Though a work in progress, chapters 2 - 4 available for preview are fantastic.
Look forward to a paperback edition, which I certainly hope to own...
Mozilla pedigree, AI focus, Open-source, Dev oriented.
Blueprint Hub: Mozilla.ai's Hub of open-source templtaized customizable AI solutions for developers.
Lumigator: Platform for model evaluation and selection. Consists a Python FastAPI backend for AI lifecycle management & capturing workflow data useful for evaluation.
Streamlit is a web wrapper for Data Science projects in pure Python. It's a lightweight, simple, rapid prototyping web app framework for sharing scripts.
Quick notes around Chinchilla Scaling Law/ Limits & beyond for DeepLearning and LLMs.
Factors
The intuitive way,
Beyond common sense, the theoretical foundations linking the factors aren't available right now. Perhaps the nature of the problem is it's hard (NP).
The next best thing then, is to somehow work out the relationships/ bounds empirically. To work with existing Deep Learning models, LLMs, etc using large data sets spanning TB/ PB of data, Trillions of parameters, etc using large compute budget cumulatively spanning years.
Papers by Hestness & Narang, Kaplan, Chinchilla are all attempts along the empirical route. So are more recent papers like Mosaic, DeepSeek, MoE, Llam3, Microsoft among many others.
Key take away being,
References
Diffusion
Non-equilibrium Thermodynamics
Gaussian Noise
Variational Inference
Loss Functions
Score Based Generative Model
Conditional (Guided) Generation
Latent Varible Generative Model
References:
References
References
A way to categorize Spark API features:
The diagram is based on code within various Spark test suites.
Continuing with the same PySpark (ver 2.1.0, Python3.5, etc.) setup explained in an earlier post. In order to connect to the mocked Kinesis stream on Localstack from PySpark use the kinesis_wordcount_asl.py script located in Spark external/ (connector/) folder.
(a) Update value of master in kinesis_wordcount_asl.py
Update value of master(local[n], spark://localhost:7077, etc) in SparkContext in kinesis_wordcount_asl.py:
sc = SparkContext(appName="PythonStreamingKinesisWordCountAsl",master="local[2]")
(b) Add aSpark compiled jars to Spark Driver/ Executor Classpath
As explained in step (III) of an earlier post, to work with Localstack a few changes were done to the KinesisReceiver.scala onStart() to explicitly set endPoint on kinesis, dynamoDb, cloudWatch clients. Accordingly the compiled aSpark jars with the modifications need to be added to Spark Driver/ Executor classpath.
export aSPARK_PROJ_HOME="/Downlaod/Location/aSpark"
export SPARK_CLASSPATH="${aSPARK_PROJ_HOME}/target/original-aSpark_1.0-2.1.0.jar:${aSPARK_PROJ_HOME}/target/scala-2.11/classes:${aSPARK_PROJ_HOME}/target/scala-2.11/jars/*"
(c) Ensure SPARK_HOME, PYSPARK_PYTHON & PYTHONPATH variables are exported.
(d) Run kinesis_wordcount_asl
python3.5 ${SPARK_HOME}/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py SampleKinesisApplication myFirstStream http://localhost:4566/ us-east-1
aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
In this post we get a Spark streaming application working with AWS Kinesis stream, a mocked version of Kinesis running locally on Localstack. In earlier posts we have explained how to get Localstack running and various AWS services up on Localstack. The client connections to AWS services (Localstack) is done using AWS cli and AWS Java-Sdk v1.
Environment: This set-up continues on a Ubuntu20.04, with Java-8, Maven-3.6x, Docker-24.0x, Python3.5, PySpark/ Spark-2.1.0, Localstack-3.8.1, AWS Java-Sdk-v1 (ver.1.12.778),
Once the Localstack installation is done, steps to follow are:
(I) Start Localstack
# Start locally
localstack start
That should get Localstack should be running on: http://localhost:4566
(II) Check Kinesis services from CLI on Localstack
# List Streams
aws --endpoint-url=http://localhost:4566 kinesis list-streams
# Create Stream
aws --endpoint-url=http://localhost:4566 kinesis create-stream --stream-name myFirstStream --shard-count 1
# List Streams
aws --endpoint-url=http://localhost:4566 kinesis list-streams
# describe-stream-summary
aws --endpoint-url=http://localhost:4566 kinesis describe-stream-summary --stream-name myFirstStream
# Put Record
aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"
(III) Connect to Kinesis from Spark Streaming
# Build
mvn install -DskipTests=true -Dcheckstyle.skip
# Run JavaKinesisWordCountASL with Localstack
(IV) Add Data to Localstack Kinesis & View Counts on Console
a) Put Record from cli
aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata abcd"
aws --endpoint-url=http://localhost:4566 kinesis put-record --stream-name myFirstStream --partition-key 123 --data "testdata efgh"
b) Alternatively Put records from Java Kinesis application
Download, build & run AmazonKinesisRecordProducerSample.java
c) Now check the output console of JavaKinesisWordCountASL run in step (III) above. Counts of the words streamed from Localstack Kinesis will be displayed on the console.