In this article, I will try to explain how Spark works internally and what the components of execution are: jobs, tasks, and stages.
Job Tasks stages
https://dzone.com/articles/how-spark-internally-executes-a-program
So once you perform any action on an RDD, Spark context gives your program
to the driver.
The driver creates the DAG (directed acyclic graph) or execution plan (job)
for your program. Once the DAG is created, the driver divides this DAG into
a number of stages. These stages are then divided into smaller tasks
and all the tasks are given to the executors for execution.
But why did Spark divided this program into two stages? Why not more than two or less than two? Basically, it depends on shuffling, i.e. whenever you perform any transformation where Spark needs to shuffle the data by communicating to the other partitions, it creates other stages for such transformations. And the transformation does not require the shuffling of your data; it creates a single stage for it.
But why did Spark divide only two tasks for each stage? It depends on your number of partitions.
DAG(JOB) -- > Splits into number of stages --> Tasks are given to executors -- > Executors will be executed in a machine.
One machine can have multiple executors with multiple cores.
Job Tasks stages
https://dzone.com/articles/how-spark-internally-executes-a-program
So once you perform any action on an RDD, Spark context gives your program
to the driver.
The driver creates the DAG (directed acyclic graph) or execution plan (job)
for your program. Once the DAG is created, the driver divides this DAG into
a number of stages. These stages are then divided into smaller tasks
and all the tasks are given to the executors for execution.
But why did Spark divided this program into two stages? Why not more than two or less than two? Basically, it depends on shuffling, i.e. whenever you perform any transformation where Spark needs to shuffle the data by communicating to the other partitions, it creates other stages for such transformations. And the transformation does not require the shuffling of your data; it creates a single stage for it.
But why did Spark divide only two tasks for each stage? It depends on your number of partitions.
DAG(JOB) -- > Splits into number of stages --> Tasks are given to executors -- > Executors will be executed in a machine.
One machine can have multiple executors with multiple cores.
No comments:
Post a Comment