all repos — gemini-redirect @ 4e76ffb372acbefd1eed19ac74caf7f564084240

content/ribw/a-practical-example-with-hadoop/post.md (view raw)

  1```meta
  2title: A practical example with Hadoop
  3published: 2020-04-01T02:00:00+00:00
  4updated: 2020-04-03T08:43:41+00:00
  5```
  6
  7In our [previous Hadoop post](/blog/ribw/introduction-to-hadoop-and-its-mapreduce/), we learnt what it is, how it originated, and how it works, from a theoretical standpoint. Here we will instead focus on a more practical example with Hadoop.
  8
  9This post will showcase my own implementation to implement a word counter for any plain text document that you want to analyze.
 10
 11## Installation
 12
 13Before running any piece of software, its executable code must first be downloaded into our computers so that we can run it. Head over to [Apache Hadoop’s releases](http://hadoop.apache.org/releases.html) and download the [latest binary version](https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz) at the time of writing (3.2.1).
 14
 15We will be using the [Linux Mint](https://linuxmint.com/) distribution because I love its simplicity, although the process shown here should work just fine on any similar Linux distribution such as [Ubuntu](https://ubuntu.com/).
 16
 17Once the archive download is complete, extract it with any tool of your choice (graphical or using the terminal) and execute it. Make sure you have a version of Java installed, such as [OpenJDK](https://openjdk.java.net/).
 18
 19Here are all the three steps in the command line:
 20
 21```
 22wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
 23tar xf hadoop-3.2.1.tar.gz
 24hadoop-3.2.1/bin/hadoop version
 25```
 26
 27## Processing data
 28
 29To take advantage of Hadoop, we have to design our code to work in the MapReduce model. Both the map and reduce phase work on key-value pairs as input and output, and both have a programmer-defined function.
 30
 31We will use Java, because it’s a dependency that we already have anyway, so might as well.
 32
 33Our map function needs to split each of the lines we receive as input into words, and we will also convert them to lowercase, thus preparing the data for later use (counting words). There won’t be bad records, so we don’t have to worry about that.
 34
 35Copy or reproduce the following code in a file called `WordCountMapper.java`, using any text editor of your choice:
 36
 37```
 38import java.io.IOException;
 39
 40import org.apache.hadoop.io.IntWritable;
 41import org.apache.hadoop.io.LongWritable;
 42import org.apache.hadoop.io.Text;
 43import org.apache.hadoop.mapreduce.Mapper;
 44
 45public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
 46    @Override
 47    public void map(LongWritable key, Text value, Context context)
 48            throws IOException, InterruptedException {
 49        for (String word : value.toString().split("\\W")) {
 50            context.write(new Text(word.toLowerCase()), new IntWritable(1));
 51        }
 52    }
 53}
 54```
 55
 56Now, let’s create the `WordCountReducer.java` file. Its job is to reduce the data from multiple values into just one. We do that by summing all the values (our word count so far):
 57
 58```
 59import java.io.IOException;
 60import java.util.Iterator;
 61
 62import org.apache.hadoop.io.IntWritable;
 63import org.apache.hadoop.io.Text;
 64import org.apache.hadoop.mapreduce.Reducer;
 65
 66public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
 67    @Override
 68    public void reduce(Text key, Iterable<IntWritable> values, Context context)
 69            throws IOException, InterruptedException {
 70        int count = 0;
 71        for (IntWritable value : values) {
 72            count += value.get();
 73        }
 74        context.write(key, new IntWritable(count));
 75    }
 76}
 77```
 78
 79Let’s just take a moment to appreciate how absolutely tiny this code is, and it’s Java! Hadoop’s API is really awesome and lets us write such concise code to achieve what we need.
 80
 81Last, let’s write the `main` method, or else we won’t be able to run it. In our new file `WordCount.java`:
 82
 83```
 84import org.apache.hadoop.fs.Path;
 85import org.apache.hadoop.io.IntWritable;
 86import org.apache.hadoop.io.Text;
 87import org.apache.hadoop.mapreduce.Job;
 88import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 89import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 90
 91public class WordCount {
 92    public static void main(String[] args) throws Exception {
 93        if (args.length != 2) {
 94            System.err.println("usage: java WordCount <input path> <output path>");
 95            System.exit(-1);
 96        }
 97
 98        Job job = Job.getInstance();
 99
100        job.setJobName("Word count");
101        job.setJarByClass(WordCount.class);
102        job.setMapperClass(WordCountMapper.class);
103        job.setReducerClass(WordCountReducer.class);
104        job.setOutputKeyClass(Text.class);
105        job.setOutputValueClass(IntWritable.class);
106
107        FileInputFormat.addInputPath(job, new Path(args[0]));
108        FileOutputFormat.setOutputPath(job, new Path(args[1]));
109
110        boolean result = job.waitForCompletion(true);
111
112        System.exit(result ? 0 : 1);
113    }
114}
115```
116
117And compile by including the required `.jar` dependencies in Java’s classpath with the `-cp` switch:
118
119```
120javac -cp "hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/mapreduce/*" *.java
121```
122
123At last, we can run it (also specifying the dependencies in the classpath, this one’s a mouthful). Let’s run it on the same `WordCount.java` source file we wrote:
124
125```
126java -cp ".:hadoop-3.2.1/share/hadoop/common/*:hadoop-3.2.1/share/hadoop/common/lib/*:hadoop-3.2.1/share/hadoop/mapreduce/*:hadoop-3.2.1/share/hadoop/mapreduce/lib/*:hadoop-3.2.1/share/hadoop/yarn/*:hadoop-3.2.1/share/hadoop/yarn/lib/*:hadoop-3.2.1/share/hadoop/hdfs/*:hadoop-3.2.1/share/hadoop/hdfs/lib/*" WordCount WordCount.java results
127```
128
129Hooray! We should have a new `results/` folder along with the following files:
130
131```
132$ ls results
133part-r-00000  _SUCCESS
134$ cat results/part-r-00000 
135	154
1360	2
1371	3
1382	1
139addinputpath	1
140apache	6
141args	4
142boolean	1
143class	6
144count	1
145err	1
146exception	1
147-snip- (output cut for clarity)
148```
149
150It worked! Now this example was obviously tiny, but hopefully enough to demonstrate how to get the basics running on real world data.