nr: #1 dodano: 2016-12-27 08:12
How do stages execute in a Spark job
Stages of a job can run in parallel if there is no dependencies among them.
In Spark, stages are split by boundries. You have a shuffle stage, which is a boundary stage where transformations are split at, i.e.
reduceByKey, and you have a result stage, which are stages that are bound to yield a result without causing a shuffle, i.e. a
(Picture provided by Cloudera)
groupByKey is a shuffle stage, you see the split in pink boxes which marks a boundary.
Internally, a stage is further divided into tasks. e.g in the picture above, the first row which does
textFile -> map -> filter, can be split into three tasks, one for each transformation.
When one transformations output is another transformations input, we need the serial execution. But, if stages are unrelated, i.e
hadoopFile -> groupByKey -> map, they can run in parallel. Once they declare a dependency between them from that stage on they will continue execution serially.