Monday, January 6, 2025

Spark API Categorization

A way to categorize Spark API features:

  • Flow of data is generally across the category swim lanes, from creation of a New Spark Context to reading data using I/O to Filter, Map/ Transform, Reduce/ Agg etc Action.
  • Lazy processing upto Transformation.
  • Steps only get executed once an Action is invoke.
  • Post Actions (Reduce, Collect, etc) there could again be I/O, thus the reverse flow from Action
  • Partition is a cross cutting concern across all layers. For I/O, Transformations, Actions could be across all or a few Partitions.
  • forEach on the Stream could be at either at Transform or Action levels.

The diagram is based on code within various Spark test suites

No comments:

Post a Comment