output of all mappers will be moved to reducer machine. This data movement activity is
called as shuffling.
Mappers will run in parallel.
Reducer code runs on the output of each mapper output.
Number of mappers will be equal to number of blocks, but simultaneously there could be number of mappers
running equals to number of nodes.
ex : 1 Gb file, 2 machine spawns with 128MB block size, 8 mappers but there will be only 2 active mappers will
run in parallel.
Official MR paper is having example of word count example. Mapper and reducer can work only on key, value pairs
As Mapper can understand only key, value pair, we need something to convert file content to
key value pair, i,e RecordReader.
RecordReader : will convert each line to key value pair. It will add dummy key for the line.
RecordReader will work on block.
Assumbe below is the fileContent
Hello how are you
hi i am good
test MR
||
||
Record reader converts it to
||
||
(0,Hello how are you)
(1,Hello i am good)
(2,test MR)
||
||
This output will be fed to mapper.
||
||
(Hello,1)
(how,1)
(are,1)
(Hello,1)
(i,1)
(am,1)
(good,1)
(test,1)
(MR,1)
Moving data from mappers node to reducer node is called as shuffle process.
||
||
(Hello,1)
(how,1)
(are,1)
(Hello,1)
(i,1)
(am,1)
(good,1)
(test,1)
(MR,1)
||
||
sorts based on key
(are,1)
(am,1)
(good,1)
(Hello,{1,1})
(how,1)
(MR,1)
Shuffling and sorting will be done by MR framework.
This output will be fed to Reducer. Reducer will also run in the datanode.
Named node decides which data node can work as reducer, it can be one of the data node
or new node.
To optmise do most of the operations in mapper, reducer should be doing very small work.
Can we increase the number of reducers more than one?
yes we can do it. By default number of reducers will be 1. but can be increased.
Number of reducers can be 0 as well.
Shuffling and sorting will come to picture only if Reducer is there.
================================================================================
called as shuffling.
Mappers will run in parallel.
Reducer code runs on the output of each mapper output.
Number of mappers will be equal to number of blocks, but simultaneously there could be number of mappers
running equals to number of nodes.
ex : 1 Gb file, 2 machine spawns with 128MB block size, 8 mappers but there will be only 2 active mappers will
run in parallel.
Official MR paper is having example of word count example. Mapper and reducer can work only on key, value pairs
As Mapper can understand only key, value pair, we need something to convert file content to
key value pair, i,e RecordReader.
RecordReader : will convert each line to key value pair. It will add dummy key for the line.
RecordReader will work on block.
Assumbe below is the fileContent
Hello how are you
hi i am good
test MR
||
||
Record reader converts it to
||
||
(0,Hello how are you)
(1,Hello i am good)
(2,test MR)
||
||
This output will be fed to mapper.
||
||
(Hello,1)
(how,1)
(are,1)
(Hello,1)
(i,1)
(am,1)
(good,1)
(test,1)
(MR,1)
Moving data from mappers node to reducer node is called as shuffle process.
||
||
(Hello,1)
(how,1)
(are,1)
(Hello,1)
(i,1)
(am,1)
(good,1)
(test,1)
(MR,1)
||
||
sorts based on key
(are,1)
(am,1)
(good,1)
(Hello,{1,1})
(how,1)
(MR,1)
Shuffling and sorting will be done by MR framework.
This output will be fed to Reducer. Reducer will also run in the datanode.
Named node decides which data node can work as reducer, it can be one of the data node
or new node.
To optmise do most of the operations in mapper, reducer should be doing very small work.
Can we increase the number of reducers more than one?
yes we can do it. By default number of reducers will be 1. but can be increased.
Number of reducers can be 0 as well.
Shuffling and sorting will come to picture only if Reducer is there.
================================================================================
No comments:
Post a Comment