Alfresco: MR with cloudera

Friday, November 15, 2019

MR with cloudera

Download virtual box
Download cloudera vm

launch ovf file so that set up is done.

MR code

public class SampleMapper extends Mapper{

}

LongWritable - input key to the Mapper
first Text - Text input line

second Text - Output from the mapper
IntWritable - Output from the mapper

map method will have some dummy implementation
public class SampleMapper extends Mapper{
public void map(LongWritable key, Text value , Context context) throws IOException,InterruptedException{
String line = value.toString();
String[] words = line.split(" ");
for(String word : words){
context.write(new Text(word), new IntWritable(1))
}

}
}

Context - processing results will be written to context object.
LongWritable key, - input key
Text value - input value
context.write(new Text(word), new IntWritable(1))
new Text- to make hadoop undesrstand the response use new Text, new IntWritable

Reduce will be called after shuffle and sort phase
public class SampleReducer extends {
public void reduce(Text key, Iterable values , Content context){
long count =0 ;
for(IntWritable value : values){
count = count + value.get();
}

context.write(key , new LongWritable(count))
}
}

Main code to invoke these MR
public class Main extends Configured implements Tool{
@override
public int run(String[] args)throws Exception{

}
}

job.setOutputKeyClass(Text.class) - Mention the data type of reducer
job.setOutputValueClass(LongWritable.class)

First we have to configure the reducer output.

If any of the key and value type of reducer is same then we need not mention that
job.setMapOutputValueClass(IntWritable.class)

Mapper and reducer classes are configured as below
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

Input can be file or directory in MR. However output will be always folder
FileInputFormat.addInputPath(job, inputFilePath);
FileOutputFormat.setOutputPath(job, outputFilePath);

To safeguard framework does not allow to override the output folder.
Main is the entry point, which acts as the template.
By changing TextOutFormatClass and TextInputFormatClass we can have the custom format supported.
By default TextInputFormat and textOutputFormat are the input format and output formats.

_success will be empty file

Text - is hadoop derived thing.

[cloudera@quickstart Desktop]$ rm -rf mapreduce_output/
[cloudera@quickstart Desktop]$ cd mapreduce_output/
[cloudera@quickstart mapreduce_output]$ ls
part-r-00000 part-r-00001 _SUCCESS
[cloudera@quickstart mapreduce_output]$ cat *
basan 1
repeat 2
you 2
are 1
hello 2
how 1
world 1
[cloudera@quickstart mapreduce_output]$
cat * can be used for seeing all the contents of the folder

If we pass 0 as the reducers then we will get the output of the mappers as the final procuct.

Alfresco

Friday, November 15, 2019

MR with cloudera

No comments:

Post a Comment