A way to categorize Spark API features:
- Flow of data is generally across the category swim lanes, from creation of a New Spark Context to reading data using I/O to Filter, Map/ Transform, Reduce/ Agg etc Action.
- Lazy processing upto Transformation.
- Steps only get executed once an Action is invoke.
- Post Actions (Reduce, Collect, etc) there could again be I/O, thus the reverse flow from Action
- Partition is a cross cutting concern across all layers. For I/O, Transformations, Actions could be across all or a few Partitions.
- forEach on the Stream could be at either at Transform or Action levels.
The diagram is based on code within various Spark test suites.
No comments:
Post a Comment