Alfresco: Spark submit job

Sunday, November 24, 2019

Spark submit job

originalRDD.map(x => {
(x.split(":")(0), x.split(":")(1))
}).groupByKey().collect().foreach(x => println(
x._1, x._2.size))

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Operation timed out: acbc32a57245.target.com/10.112.144.201:60678

320MB
local will have 11 partitions as each 32 MB

groupByKey
will lead to
1. lot of shuffle
2. only few number of machines will work based on number of keys
3. chances of memory spill

What is difference between groupByKey and reduceByKey?
local aggregation will happen. Kind of local combiner of MR comes to picture.

local calculation of the block will happen and then those results sent to driver.

Dont use groupByKey instead use reduceByKey

rWhat is educe and reduceByKey difference?

reduce is the action - as it gives one value as local variable.
reduceByKey is transformation which gives rdd.

What is narrow and wide transformation?
narrow - no shuffling

What is stage ?
Why spark is lazy?
predicate pushdown : Internally reorder the plans. Using predicate pushdown which means moving filter up.
pipelingin : it can combine the logical blocks

Why RDDS are immutable?
To achieve resiellency.

countByValue and reduceByKey
If the operation is final and subsequent rdd operation is not needed use countByValue.
If the operation is not final and subsequent rdd operation is needed use reduceByKey.

what is the difference between cache and persist?

cache has to be always in memory.
persisit can be in memory, disk , memory disk

rdd1, rdd2, rdd3, rdd4 are the set of transformations
if there are chain of transformations, and we have cleaned all data and if we want to checkpoint
use rdd4.cache

If it does not fit in memory - use persist.

rdd1, rdd2, rdd3, rdd4.cache, rdd5, rdd6 action1 action2

action1 : For action1 all rdd tranformations rdd1 to rdd6 will be done
action2 : For action2 tranformations rdd4 to rdd6 will be done

======================================================================================================

cd /Users/z002qhl/Documents/Spark/spark-2.4.1-bin-hadoop2.7/bin
./spark-submit --class com.basan.day3.ErrorWarnCount /Users/z002qhl/Desktop/sparkDemo.jar

Alfresco

Sunday, November 24, 2019

Spark submit job

No comments:

Post a Comment