git rekt — gemini-redirect.git (299dad40a464132c708bff0aa139c6ddebc55409): blog/mdad/a-practical-example-with-hadoop/index.html

blog/mdad/a-practical-example-with-hadoop/index.html (view raw)
  1<!DOCTYPE html>
  2<html>
  3<head>
  4<meta charset="utf-8" />
  5<meta name="viewport" content="width=device-width, initial-scale=1" />
  6<title>A practical example with Hadoop</title>
  7<link rel="stylesheet" href="../css/style.css">
  8</head>
  9<body>
 10<main>
 11<p>In our <a href="/blog/mdad/introduction-to-hadoop-and-its-mapreduce/">previous Hadoop post</a>, we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.</p>
 12<div class="date-created-modified">Created 2020-03-30<br>
 13Modified 2020-04-18</div>
 14<p>This post will reproduce the example on Chapter 2 of the book <a href="http://www.hadoopbook.com/">Hadoop: The Definitive Guide, Fourth Edition</a> (<a href="http://grut-computing.com/HadoopBook.pdf">pdf,</a><a href="http://www.hadoopbook.com/code.html">code</a>), that is, finding the maximum global-wide temperature for a given year.</p>
 15<h2 class="title" id="installation"><a class="anchor" href="#installation">¶</a>Installation</h2>
 16<p>Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to <a href="http://hadoop.apache.org/releases.html">Apache Hadoop’s releases</a> and download the <a href="https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz">latest binary version</a> at the time of writing (3.2.1).</p>
 17<p>We will be using the <a href="https://linuxmint.com/">Linux Mint</a> distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as <a href="https://ubuntu.com/">Ubuntu</a>.</p>
 18<p>Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as <a href="https://openjdk.java.net/">OpenJDK</a>.</p>
 19<p>Here are all the three steps in the command line:</p>
 20<pre><code>wget https://apache.brunneis.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
 21tar xf hadoop-3.2.1.tar.gz
 22hadoop-3.2.1/bin/hadoop version
 23</code></pre>
 24<p>We will be using the two example data files that they provide in <a href="https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all">their GitHub repository</a>, although the full dataset is offered by the <a href="https://www.ncdc.noaa.gov/">National Climatic Data Center</a> (NCDC).</p>
 25<p>We will also unzip and concatenate both files into a single text file, to make it easier to work with. As a single command pipeline:</p>
 26<pre><code>curl https://raw.githubusercontent.com/tomwhite/hadoop-book/master/input/ncdc/all/190{1,2}.gz | gunzip &gt; 190x
 27</code></pre>
 28<p>This should create a <code>190x</code> text file in the current directory, which will be our input data.</p>
 29<h2 id="processing_data"><a class="anchor" href="#processing_data">¶</a>Processing data</h2>
 30<p>To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.</p>
 31<p>We will use Java, because it’s a dependency that we already have anyway, so might as well.</p>
 32<p>Our map function needs to extract the year and air temperature, which will prepare the data for later use (finding the maximum temperature for each year). We will also drop bad records here (if the temperature is missing, suspect or erroneous).</p>
 33<p>Copy or reproduce the following code in a file called <code>MaxTempMapper.java</code>, using any text editor of your choice:</p>
 34<pre><code>import java.io.IOException;
 35
 36import org.apache.hadoop.io.IntWritable;
 37import org.apache.hadoop.io.LongWritable;
 38import org.apache.hadoop.io.Text;
 39import org.apache.hadoop.mapreduce.Mapper;
 40
 41public class MaxTempMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {
 42    private static final int TEMP_MISSING = 9999;
 43    private static final String GOOD_QUALITY_RE = &quot;[01459]&quot;;
 44
 45    @Override
 46    public void map(LongWritable key, Text value, Context context)
 47            throws IOException, InterruptedException {
 48        String line = value.toString();
 49        String year = line.substring(15, 19);
 50        String temp = line.substring(87, 92).replaceAll(&quot;^\\+&quot;, &quot;&quot;);
 51        String quality = line.substring(92, 93);
 52
 53        int airTemperature = Integer.parseInt(temp);
 54        if (airTemperature != TEMP_MISSING &amp;&amp; quality.matches(GOOD_QUALITY_RE)) {
 55            context.write(new Text(year), new IntWritable(airTemperature));
 56        }
 57    }
 58}
 59</code></pre>
 60<p>Now, let’s create the <code>MaxTempReducer.java</code> file. Its job is to reduce the data from multiple values into just one. We do that by keeping the maximum out of all the values we receive:</p>
 61<pre><code>import java.io.IOException;
 62import java.util.Iterator;
 63
 64import org.apache.hadoop.io.IntWritable;
 65import org.apache.hadoop.io.Text;
 66import org.apache.hadoop.mapreduce.Reducer;
 67
 68public class MaxTempReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
 69    @Override
 70    public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context)
 71            throws IOException, InterruptedException {
 72        Iterator&lt;IntWritable&gt; iter = values.iterator();
 73        if (iter.hasNext()) {
 74            int maxValue = iter.next().get();
 75            while (iter.hasNext()) {
 76                maxValue = Math.max(maxValue, iter.next().get());
 77            }
 78            context.write(key, new IntWritable(maxValue));
 79        }
 80    }
 81}
 82</code></pre>
 83<p>Except for some Java weirdness (…why can’t we just iterate over an <code>Iterator</code>? Or why can’t we just manually call <code>next()</code> on an <code>Iterable</code>?), our code is correct. There can’t be a maximum if there are no elements, and we want to avoid dummy values such as <code>Integer.MIN_VALUE</code>.</p>
 84<p>We can also take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.</p>
 85<p>Last, let’s write the <code>main</code> method, or else we won’t be able to run it. In our new file <code>MaxTemp.java</code>:</p>
 86<pre><code>import org.apache.hadoop.fs.Path;
 87import org.apache.hadoop.io.IntWritable;
 88import org.apache.hadoop.io.Text;
 89import org.apache.hadoop.mapreduce.Job;
 90import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 91import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 92
 93public class MaxTemp {
 94    public static void main(String[] args) throws Exception {
 95        if (args.length != 2) {
 96            System.err.println(&quot;usage: java MaxTemp &lt;input path&gt; &lt;output path&gt;&quot;);
 97            System.exit(-1);
 98        }
 99
100        Job job = Job.getInstance();
101
102        job.setJobName(&quot;Max temperature&quot;);
103        job.setJarByClass(MaxTemp.class);
104        job.setMapperClass(MaxTempMapper.class);
105        job.setReducerClass(MaxTempReducer.class);
106        job.setOutputKeyClass(Text.class);
107        job.setOutputValueClass(IntWritable.class);
108
109        FileInputFormat.addInputPath(job, new Path(args[0]));
110        FileOutputFormat.setOutputPath(job, new Path(args[1]));
111
112        boolean result = job.waitForCompletion(true);
113
114        System.exit(result ? 0 : 1);
115    }
116}
117</code></pre>
118<p>And compile by including the required <code>.jar</code> dependencies in Java’s classpath with the <code>-cp</code> switch:</p>
119<pre><code>javac -cp &quot;hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*&quot; *.java
120</code></pre>
121<p>At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful):</p>
122<pre><code>java -cp &quot;.:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*&quot; MaxTemp 190x results
123</code></pre>
124<p>Hooray! We should have a new <code>results/</code> folder along with the following files:</p>
125<pre><code>$ ls results
126part-r-00000  _SUCCESS
127$ cat results/part-r-00000 
1281901	317
1291902	244
130</code></pre>
131<p>It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.</p>
132</main>
133</body>
134</html>
135