stages are marked by boundaries
There are 2 kind of transformations
Narrow transformations and wide transformations
Narrow transformations : works on data locality. No shuffle not required
map, filter
wide transformations : shuffle happens
reduceBykey
groupByKey
wide transformations takes more time.
==================
Difference between repartition and coalesce
getting number of partitons
rdd.getNumpartitions
To change the partition we can use
repartition or coalesce.
To increase the number of partitions repartition has to be used.
TO decrease the number of partitions repartition and coalesce can be used. Ideally go with coalesce for decreasing partitions.
coalesce : it will try to avoid shuffling.
1 GB file in HDFS can create - 8 partitions assuming block size of 128MB, assuming 4 data nodes
rdd.coalesce(4) - no shuffle happens
rdd.coalesce(2) - shuffle happens
It might so happen that we would have 1000s of blocks, and each block we might have very less records
then we coalesce and reduce the partitions.
======================================================
How many times each movie is watched.
Map side join : smaller data set was kept in each machine.
================== ====================================
Map side join is nothting but Broadcast join
Broadcast join :
smaller data been broadcasted and joined.
In the driver side broadcast it
// Create a broadcast variable of our ID -> movie name map
var nameDict = sc.broadcast(loadMovieNames)
broadcasted map can be accessed as below nameDict.value(x._2)
val sortedMoviesWithNames = sortedMovies.
map(x => (nameDict.value(x._2), x._1))
There are 2 kind of transformations
Narrow transformations and wide transformations
Narrow transformations : works on data locality. No shuffle not required
map, filter
wide transformations : shuffle happens
reduceBykey
groupByKey
wide transformations takes more time.
==================
Difference between repartition and coalesce
getting number of partitons
rdd.getNumpartitions
To change the partition we can use
repartition or coalesce.
To increase the number of partitions repartition has to be used.
TO decrease the number of partitions repartition and coalesce can be used. Ideally go with coalesce for decreasing partitions.
coalesce : it will try to avoid shuffling.
1 GB file in HDFS can create - 8 partitions assuming block size of 128MB, assuming 4 data nodes
rdd.coalesce(4) - no shuffle happens
rdd.coalesce(2) - shuffle happens
It might so happen that we would have 1000s of blocks, and each block we might have very less records
then we coalesce and reduce the partitions.
======================================================
How many times each movie is watched.
Map side join : smaller data set was kept in each machine.
================== ====================================
Map side join is nothting but Broadcast join
Broadcast join :
smaller data been broadcasted and joined.
In the driver side broadcast it
// Create a broadcast variable of our ID -> movie name map
var nameDict = sc.broadcast(loadMovieNames)
broadcasted map can be accessed as below nameDict.value(x._2)
val sortedMoviesWithNames = sortedMovies.
map(x => (nameDict.value(x._2), x._1))
No comments:
Post a Comment