MapReduce Fundamentals
MapReduce is Hadoop's built-in programming model for processing large datasets in parallel. A job is divided into two phases: Map and Reduce.
The MapReduce Model
Input Data (HDFS blocks)
|
v
Mapper (runs on each block locally) — emits (key, value) pairs
|
v
Shuffle & Sort (framework groups values by key)
|
v
Reducer (aggregates values per key)
|
v
Output Data (written back to HDFS)
Classic Example: Word Count
Mapper
public class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one); // emit (word, 1)
}
}
}
Reducer
public class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Running the Job
hdfs dfs -mkdir -p /input
hdfs dfs -put *.txt /input/
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
wordcount /input /output
hdfs dfs -cat /output/part-r-00000
Key Concepts
| Concept | Description |
|---|---|
| InputSplit | A logical chunk of input assigned to one Mapper |
| Combiner | Optional mini-Reducer that runs after Map to reduce shuffle data |
| Partitioner | Determines which Reducer receives each key |
Combiner Optimization
job.setCombinerClass(IntSumReducer.class);
Monitoring Jobs
mapred job -list
mapred job -status <job_id>
mapred job -kill <job_id>
Next Steps
See YARN & Resource Management to understand how Hadoop schedules and manages jobs.