Powered By Blogger

Saturday, November 23, 2019

Broadcast join

stages are marked by boundaries

There are 2 kind of transformations
Narrow transformations and wide transformations

Narrow transformations : works on data locality. No shuffle not required
map, filter


wide transformations : shuffle happens
reduceBykey
groupByKey

wide transformations takes more time.

==================
Difference between repartition and coalesce
getting number of partitons

rdd.getNumpartitions

To change the partition we can use

repartition or coalesce.

To increase the number of partitions repartition has to be used.
TO decrease the number of partitions repartition and coalesce can be used. Ideally go with coalesce for decreasing partitions.

coalesce : it will try to avoid shuffling.

1 GB file in HDFS can create - 8 partitions assuming block size of 128MB, assuming 4 data nodes

rdd.coalesce(4) - no shuffle happens
rdd.coalesce(2) - shuffle happens

It might so happen that we would have 1000s of blocks, and each block we might have very less records
then we coalesce and reduce the partitions.

======================================================
How many times each movie is watched.

Map side join : smaller data set was kept in each machine.
================== ====================================

Map side join is nothting but Broadcast join


Broadcast  join :
smaller data been broadcasted and joined.

In the driver side broadcast it
    // Create a broadcast variable of our ID -> movie name map
    var nameDict = sc.broadcast(loadMovieNames)


broadcasted map can be accessed as below nameDict.value(x._2)
 val sortedMoviesWithNames = sortedMovies.
      map(x => (nameDict.value(x._2), x._1))

No comments:

Post a Comment