all repos — gemini-redirect @ 0f107ee26e080b231f19ee4459735b85d9c5158d

content/blog/mdad/a-practical-example-with-hadoop/index.md (view raw)

  1+++
  2title = "A practical example with Hadoop"
  3date = 2020-03-30T01:00:00+00:00
  4updated = 2020-04-18T13:25:43+00:00
  5+++
  6
  7In our [previous Hadoop post](/blog/mdad/introduction-to-hadoop-and-its-mapreduce/), we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.
  8
  9This post will reproduce the example on Chapter 2 of the book [Hadoop: The Definitive Guide, Fourth Edition](http://www.hadoopbook.com/) ([pdf,](http://grut-computing.com/HadoopBook.pdf)[code](http://www.hadoopbook.com/code.html)), that is, finding the maximum global-wide temperature for a given year.
 10
 11## Installation
 12
 13Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to [Apache Hadoop’s releases](http://hadoop.apache.org/releases.html) and download the [latest binary version](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz) at the time of writing (3.2.1).
 14
 15We will be using the [Linux Mint](https://linuxmint.com/) distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as [Ubuntu](https://ubuntu.com/).
 16
 17Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as [OpenJDK](https://openjdk.java.net/).
 18
 19Here are all the three steps in the command line:
 20
 21```
 22wget https://apache.brunneis.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
 23tar xf hadoop-3.2.1.tar.gz
 24hadoop-3.2.1/bin/hadoop version
 25```
 26
 27We will be using the two example data files that they provide in [their GitHub repository](https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all), although the full dataset is offered by the [National Climatic Data Center](https://www.ncdc.noaa.gov/) (NCDC).
 28
 29We will also unzip and concatenate both files into a single text file, to make it easier to work with. As a single command pipeline:
 30
 31```
 32curl https://raw.githubusercontent.com/tomwhite/hadoop-book/master/input/ncdc/all/190{1,2}.gz | gunzip > 190x
 33```
 34
 35This should create a `190x` text file in the current directory, which will be our input data.
 36
 37## Processing data
 38
 39To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.
 40
 41We will use Java, because it’s a dependency that we already have anyway, so might as well.
 42
 43Our map function needs to extract the year and air temperature, which will prepare the data for later use (finding the maximum temperature for each year). We will also drop bad records here (if the temperature is missing, suspect or erroneous).
 44
 45Copy or reproduce the following code in a file called `MaxTempMapper.java`, using any text editor of your choice:
 46
 47```
 48import java.io.IOException;
 49
 50import org.apache.hadoop.io.IntWritable;
 51import org.apache.hadoop.io.LongWritable;
 52import org.apache.hadoop.io.Text;
 53import org.apache.hadoop.mapreduce.Mapper;
 54
 55public class MaxTempMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
 56    private static final int TEMP_MISSING = 9999;
 57    private static final String GOOD_QUALITY_RE = "[01459]";
 58
 59    @Override
 60    public void map(LongWritable key, Text value, Context context)
 61            throws IOException, InterruptedException {
 62        String line = value.toString();
 63        String year = line.substring(15, 19);
 64        String temp = line.substring(87, 92).replaceAll("^\\+", "");
 65        String quality = line.substring(92, 93);
 66
 67        int airTemperature = Integer.parseInt(temp);
 68        if (airTemperature != TEMP_MISSING && quality.matches(GOOD_QUALITY_RE)) {
 69            context.write(new Text(year), new IntWritable(airTemperature));
 70        }
 71    }
 72}
 73```
 74
 75Now, let’s create the `MaxTempReducer.java` file. Its job is to reduce the data from multiple values into just one. We do that by keeping the maximum out of all the values we receive:
 76
 77```
 78import java.io.IOException;
 79import java.util.Iterator;
 80
 81import org.apache.hadoop.io.IntWritable;
 82import org.apache.hadoop.io.Text;
 83import org.apache.hadoop.mapreduce.Reducer;
 84
 85public class MaxTempReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 86    @Override
 87    public void reduce(Text key, Iterable<IntWritable> values, Context context)
 88            throws IOException, InterruptedException {
 89        Iterator<IntWritable> iter = values.iterator();
 90        if (iter.hasNext()) {
 91            int maxValue = iter.next().get();
 92            while (iter.hasNext()) {
 93                maxValue = Math.max(maxValue, iter.next().get());
 94            }
 95            context.write(key, new IntWritable(maxValue));
 96        }
 97    }
 98}
 99```
100
101Except for some Java weirdness (…why can’t we just iterate over an `Iterator`? Or why can’t we just manually call `next()` on an `Iterable`?), our code is correct. There can’t be a maximum if there are no elements, and we want to avoid dummy values such as `Integer.MIN_VALUE`.
102
103We can also take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.
104
105Last, let’s write the `main` method, or else we won’t be able to run it. In our new file `MaxTemp.java`:
106
107```
108import org.apache.hadoop.fs.Path;
109import org.apache.hadoop.io.IntWritable;
110import org.apache.hadoop.io.Text;
111import org.apache.hadoop.mapreduce.Job;
112import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
113import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
114
115public class MaxTemp {
116    public static void main(String[] args) throws Exception {
117        if (args.length != 2) {
118            System.err.println("usage: java MaxTemp <input path> <output path>");
119            System.exit(-1);
120        }
121
122        Job job = Job.getInstance();
123
124        job.setJobName("Max temperature");
125        job.setJarByClass(MaxTemp.class);
126        job.setMapperClass(MaxTempMapper.class);
127        job.setReducerClass(MaxTempReducer.class);
128        job.setOutputKeyClass(Text.class);
129        job.setOutputValueClass(IntWritable.class);
130
131        FileInputFormat.addInputPath(job, new Path(args[0]));
132        FileOutputFormat.setOutputPath(job, new Path(args[1]));
133
134        boolean result = job.waitForCompletion(true);
135
136        System.exit(result ? 0 : 1);
137    }
138}
139```
140
141And compile by including the required `.jar` dependencies in Java’s classpath with the `-cp` switch:
142
143```
144javac -cp "hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*" *.java
145```
146
147At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful):
148
149```
150java -cp ".:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*" MaxTemp 190x results
151```
152
153Hooray! We should have a new `results/` folder along with the following files:
154
155```
156$ ls results
157part-r-00000  _SUCCESS
158$ cat results/part-r-00000
1591901	317
1601902	244
161```
162
163It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.