Powered By Blogger

Tuesday, June 9, 2020

Spark Concepts:

Spark Concepts:

HashAggregate

Stage
Resource allocation dependents on number of messages/files
we would like to process.

Though even if you allocate 16 cores. if there is file 100KB
then only 1 task will run. Look at the launch time to see
how many threads/tasks are running in parallel.

Always look at Tasks - these are threads.



* is an optimization


Exchange:Shuffle is costly operation. Should avoid as much as possible.
HashPartitioning : matching records will be brought to same machine.
So that each individual can work independently.
Its costly as io is involved,


Smortmergejoin
rabgejoin.

We can givehints.

sparksession.conf.set