Develop Java MapReduce programs for Apache Hadoop on HDInsight

2024-06-14

Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.

Prerequisites

Java Developer Kit (JDK) version 8.
Apache Maven properly installed according to Apache. Maven is a project build system for Java projects.

Configure development environment

The environment used for this article was a computer running Windows 10. The commands were executed in a command prompt, and the various files were edited with Notepad. Modify accordingly for your environment.

From a command prompt, enter the commands below to create a working environment:

IF NOT EXIST C:\HDI MKDIR C:\HDI
cd C:\HDI

Create a Maven project

Enter the following command to create a Maven project named wordcountjava:
```
mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
```
This command creates a directory with the name specified by the artifactID parameter (wordcountjava in this example.) This directory contains the following items:
- pom.xml - The Project Object Model (POM) that contains information and configuration details used to build the project.
- src\main\java\org\apache\hadoop\examples: Contains your application code.
- src\test\java\org\apache\hadoop\examples: Contains tests for your application.
Remove the generated example code. Delete the generated test and application files AppTest.java, and App.java by entering the commands below:
```
cd wordcountjava
DEL src\main\java\org\apache\hadoop\examples\App.java
DEL src\test\java\org\apache\hadoop\examples\AppTest.java
```

Update the Project Object Model

For a full reference of the pom.xml file, see https://maven.apache.org/pom.html. Open pom.xml by entering the command below:

notepad pom.xml

Add dependencies

In pom.xml, add the following text in the <dependencies> section:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-examples</artifactId>
    <version>2.7.3</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-common</artifactId>
    <version>2.7.3</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.3</version>
    <scope>provided</scope>
</dependency>

This defines required libraries (listed within <artifactId>) with a specific version (listed within <version>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the Maven repository search to view more.

The <scope>provided</scope> tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.

Important

The version used should match the version of Hadoop present on your cluster. For more information on versions, see the HDInsight component versioning document.

Build configuration

Maven plug-ins allow you to customize the build stages of the project. This section is used to add plug-ins, resources, and other build configuration options.

Add the following code to the pom.xml file, and then save and close the file. This text must be inside the <project>...</project> tags in the file, for example, between </dependencies> and </project>.

<build>
    <plugins>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.3</version>
        <configuration>
        <transformers>
            <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
            </transformer>
        </transformers>
        </configuration>
        <executions>
        <execution>
            <phase>package</phase>
                <goals>
                <goal>shade</goal>
                </goals>
        </execution>
        </executions>
        </plugin>
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.6.1</version>
        <configuration>
        <source>1.8</source>
        <target>1.8</target>
        </configuration>
    </plugin>
    </plugins>
</build>

This section configures the Apache Maven Compiler Plugin and Apache Maven Shade Plugin. The compiler plug-in is used to compile the topology. The shade plug-in is used to prevent license duplication in the JAR package that is built by Maven. This plugin is used to prevent a "duplicate license files" error at run time on the HDInsight cluster. Using maven-shade-plugin with the ApacheLicenseResourceTransformer implementation prevents the error.

The maven-shade-plugin also produces an uber jar that contains all the dependencies required by the application.

Save the pom.xml file.

Create the MapReduce application

Enter the command below to create and open a new file WordCount.java. Select Yes at the prompt to create a new file.
```
notepad src\main\java\org\apache\hadoop\examples\WordCount.java
```

Then copy and paste the Java code below into the new file. Then close the file.

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
        }
    }
}

public static class IntSumReducer
        extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                        Context context
                        ) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
        sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: wordcount <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Notice the package name is org.apache.hadoop.examples and the class name is WordCount. You use these names when you submit the MapReduce job.

Build and package the application

From the wordcountjava directory, use the following command to build a JAR file that contains the application:

mvn clean package

This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.

Once the command finishes, the wordcountjava/target directory contains a file named wordcountjava-1.0-SNAPSHOT.jar.

Note

The wordcountjava-1.0-SNAPSHOT.jar file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.

Upload the JAR and run jobs (SSH)

The following steps use scp to copy the JAR to the primary head node of your Apache HBase on HDInsight cluster. The ssh command is then used to connect to the cluster and run the example directly on the head node.

Upload the jar to the cluster. Replace CLUSTERNAME with your HDInsight cluster name and then enter the following command:
```
scp target/wordcountjava-1.0-SNAPSHOT.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.net:
```
Connect to the cluster. Replace CLUSTERNAME with your HDInsight cluster name and then enter the following command:
```
ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
```
From the SSH session, use the following command to run the MapReduce application:
```
yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
```
This command starts the WordCount MapReduce application. The input file is /example/data/gutenberg/davinci.txt, and the output directory is /example/data/wordcountout. Both the input file and output are stored to the default storage for the cluster.
Once the job completes, use the following command to view the results:
```
hdfs dfs -cat /example/data/wordcountout/*
```
You should receive a list of words and counts, with values similar to the following text:
```
zeal    1
zelus   1
zenith  2
```

Next steps

In this document, you have learned how to develop a Java MapReduce job. See the following documents for other ways to work with HDInsight.