Powered By Blogger

Friday, November 15, 2019

MR with cloudera

Download virtual box
Download cloudera vm

launch ovf file so that set up is done.

MR code

public class SampleMapper extends Mapper{

}


LongWritable - input key to the Mapper
first Text -   Text input line

second Text - Output from the mapper
IntWritable - Output from the mapper

map method will have some dummy implementation
public class SampleMapper extends Mapper{
    public void map(LongWritable key, Text value , Context context) throws IOException,InterruptedException{
        String line = value.toString();
        String[] words = line.split(" ");
        for(String word : words){
          context.write(new Text(word), new IntWritable(1))
        }

    }
}

Context -  processing results will be written to context object.
LongWritable key, - input key
Text value - input value
context.write(new Text(word), new IntWritable(1))
new Text- to make hadoop undesrstand the response use new Text, new IntWritable


Reduce will be called after shuffle and sort phase
public class SampleReducer extends {
    public void reduce(Text key, Iterable values , Content context){
        long count =0 ;
        for(IntWritable value : values){
            count = count + value.get();
        }

        context.write(key , new LongWritable(count))
    }
}

Main code to invoke these MR
public class Main extends Configured implements Tool{
    @override
    public int run(String[] args)throws Exception{

    }
}


job.setOutputKeyClass(Text.class) - Mention the data type of reducer
job.setOutputValueClass(LongWritable.class)


First we have to configure the reducer output.

If any of the key and value type of reducer is same then we need not mention  that
    job.setMapOutputValueClass(IntWritable.class)

Mapper and reducer classes are configured as below
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);


Input can be file or directory in MR. However output will be always folder
    FileInputFormat.addInputPath(job, inputFilePath);
    FileOutputFormat.setOutputPath(job, outputFilePath);


To safeguard framework does not allow to override the output folder.
Main is the entry point, which acts as the template.
By changing TextOutFormatClass and TextInputFormatClass we can have the custom format supported.
By default TextInputFormat and textOutputFormat are the input format and output formats.



_success will be empty file


Text - is hadoop derived thing.

[cloudera@quickstart Desktop]$ rm -rf mapreduce_output/
[cloudera@quickstart Desktop]$ cd mapreduce_output/
[cloudera@quickstart mapreduce_output]$ ls
part-r-00000  part-r-00001  _SUCCESS
[cloudera@quickstart mapreduce_output]$ cat *
basan 1
repeat 2
you 2
are 1
hello 2
how 1
world 1
[cloudera@quickstart mapreduce_output]$
cat * can be used for seeing all the contents of the folder

If we pass 0 as the reducers then we will get the output of the mappers as the final procuct.

No comments:

Post a Comment