originalRDD.map(x => {
(x.split(":")(0), x.split(":")(1))
}).groupByKey().collect().foreach(x => println(
x._1, x._2.size))
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Operation timed out: acbc32a57245.target.com/10.112.144.201:60678
320MB
local will have 11 partitions as each 32 MB
groupByKey
will lead to
1. lot of shuffle
2. only few number of machines will work based on number of keys
3. chances of memory spill
What is difference between groupByKey and reduceByKey?
local aggregation will happen. Kind of local combiner of MR comes to picture.
local calculation of the block will happen and then those results sent to driver.
Dont use groupByKey instead use reduceByKey
rWhat is educe and reduceByKey difference?
reduce is the action - as it gives one value as local variable.
reduceByKey is transformation which gives rdd.
What is narrow and wide transformation?
narrow - no shuffling
What is stage ?
Why spark is lazy?
predicate pushdown : Internally reorder the plans. Using predicate pushdown which means moving filter up.
pipelingin : it can combine the logical blocks
Why RDDS are immutable?
To achieve resiellency.
countByValue and reduceByKey
If the operation is final and subsequent rdd operation is not needed use countByValue.
If the operation is not final and subsequent rdd operation is needed use reduceByKey.
what is the difference between cache and persist?
cache has to be always in memory.
persisit can be in memory, disk , memory disk
rdd1, rdd2, rdd3, rdd4 are the set of transformations
if there are chain of transformations, and we have cleaned all data and if we want to checkpoint
use rdd4.cache
If it does not fit in memory - use persist.
rdd1, rdd2, rdd3, rdd4.cache, rdd5, rdd6 action1 action2
action1 : For action1 all rdd tranformations rdd1 to rdd6 will be done
action2 : For action2 tranformations rdd4 to rdd6 will be done
======================================================================================================
cd /Users/z002qhl/Documents/Spark/spark-2.4.1-bin-hadoop2.7/bin
./spark-submit --class com.basan.day3.ErrorWarnCount /Users/z002qhl/Desktop/sparkDemo.jar
(x.split(":")(0), x.split(":")(1))
}).groupByKey().collect().foreach(x => println(
x._1, x._2.size))
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Operation timed out: acbc32a57245.target.com/10.112.144.201:60678
320MB
local will have 11 partitions as each 32 MB
groupByKey
will lead to
1. lot of shuffle
2. only few number of machines will work based on number of keys
3. chances of memory spill
What is difference between groupByKey and reduceByKey?
local aggregation will happen. Kind of local combiner of MR comes to picture.
local calculation of the block will happen and then those results sent to driver.
Dont use groupByKey instead use reduceByKey
rWhat is educe and reduceByKey difference?
reduce is the action - as it gives one value as local variable.
reduceByKey is transformation which gives rdd.
What is narrow and wide transformation?
narrow - no shuffling
What is stage ?
Why spark is lazy?
predicate pushdown : Internally reorder the plans. Using predicate pushdown which means moving filter up.
pipelingin : it can combine the logical blocks
Why RDDS are immutable?
To achieve resiellency.
countByValue and reduceByKey
If the operation is final and subsequent rdd operation is not needed use countByValue.
If the operation is not final and subsequent rdd operation is needed use reduceByKey.
what is the difference between cache and persist?
cache has to be always in memory.
persisit can be in memory, disk , memory disk
rdd1, rdd2, rdd3, rdd4 are the set of transformations
if there are chain of transformations, and we have cleaned all data and if we want to checkpoint
use rdd4.cache
If it does not fit in memory - use persist.
rdd1, rdd2, rdd3, rdd4.cache, rdd5, rdd6 action1 action2
action1 : For action1 all rdd tranformations rdd1 to rdd6 will be done
action2 : For action2 tranformations rdd4 to rdd6 will be done
======================================================================================================
cd /Users/z002qhl/Documents/Spark/spark-2.4.1-bin-hadoop2.7/bin
./spark-submit --class com.basan.day3.ErrorWarnCount /Users/z002qhl/Desktop/sparkDemo.jar
No comments:
Post a Comment