Spark Concepts:
HashAggregate
Stage
Resource allocation dependents on number of messages/files
we would like to process.
Though even if you allocate 16 cores. if there is file 100KB
then only 1 task will run. Look at the launch time to see
how many threads/tasks are running in parallel.
Always look at Tasks - these are threads.
* is an optimization
Exchange:Shuffle is costly operation. Should avoid as much as possible.
HashPartitioning : matching records will be brought to same machine.
So that each individual can work independently.
Its costly as io is involved,
Smortmergejoin
rabgejoin.
We can givehints.
sparksession.conf.set
HashAggregate
Stage
Resource allocation dependents on number of messages/files
we would like to process.
Though even if you allocate 16 cores. if there is file 100KB
then only 1 task will run. Look at the launch time to see
how many threads/tasks are running in parallel.
Always look at Tasks - these are threads.
* is an optimization
Exchange:Shuffle is costly operation. Should avoid as much as possible.
HashPartitioning : matching records will be brought to same machine.
So that each individual can work independently.
Its costly as io is involved,
Smortmergejoin
rabgejoin.
We can givehints.
sparksession.conf.set