Algorithms, Design, Code and more: Spark API Categorization

Monday, January 6, 2025

A way to categorize Spark API features:

Flow of data is generally across the category swim lanes, from creation of a New Spark Context to reading data using I/O to Filter, Map/ Transform, Reduce/ Agg etc Action.
Lazy processing upto Transformation.
Steps only get executed once an Action is invoke.
Post Actions (Reduce, Collect, etc) there could again be I/O, thus the reverse flow from Action
Partition is a cross cutting concern across all layers. For I/O, Transformations, Actions could be across all or a few Partitions.
forEach on the Stream could be at either at Transform or Action levels.

Algorithms, Design, Code and more