Develop Java MapReduce programs for Apache Hadoop on HDInsight
Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.
Prerequisites
Apache Maven properly installed according to Apache. Maven is a project build system for Java projects.
Configure development environment
The environment used for this article was a computer running Windows 10. The commands were executed in a command prompt, and the various files were edited with Notepad. Modify accordingly for your environment.
From a command prompt, enter the commands below to create a working environment:
IF NOT EXIST C:\HDI MKDIR C:\HDI
cd C:\HDI
Create a Maven project
Enter the following command to create a Maven project named wordcountjava:
mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
This command creates a directory with the name specified by the
artifactID
parameter (wordcountjava in this example.) This directory contains the following items:pom.xml
- The Project Object Model (POM) that contains information and configuration details used to build the project.- src\main\java\org\apache\hadoop\examples: Contains your application code.
- src\test\java\org\apache\hadoop\examples: Contains tests for your application.
Remove the generated example code. Delete the generated test and application files
AppTest.java
, andApp.java
by entering the commands below:cd wordcountjava DEL src\main\java\org\apache\hadoop\examples\App.java DEL src\test\java\org\apache\hadoop\examples\AppTest.java
Update the Project Object Model
For a full reference of the pom.xml file, see https://maven.apache.org/pom.html. Open pom.xml
by entering the command below:
notepad pom.xml
Add dependencies
In pom.xml
, add the following text in the <dependencies>
section:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-examples</artifactId>
<version>2.7.3</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>2.7.3</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
<scope>provided</scope>
</dependency>
This defines required libraries (listed within <artifactId>) with a specific version (listed within <version>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the Maven repository search to view more.
The <scope>provided</scope>
tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.
Important
The version used should match the version of Hadoop present on your cluster. For more information on versions, see the HDInsight component versioning document.
Build configuration
Maven plug-ins allow you to customize the build stages of the project. This section is used to add plug-ins, resources, and other build configuration options.
Add the following code to the pom.xml
file, and then save and close the file. This text must be inside the <project>...</project>
tags in the file, for example, between </dependencies>
and </project>
.
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
</transformer>
</transformers>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
This section configures the Apache Maven Compiler Plugin and Apache Maven Shade Plugin. The compiler plug-in is used to compile the topology. The shade plug-in is used to prevent license duplication in the JAR package that is built by Maven. This plugin is used to prevent a "duplicate license files" error at run time on the HDInsight cluster. Using maven-shade-plugin with the ApacheLicenseResourceTransformer
implementation prevents the error.
The maven-shade-plugin also produces an uber jar that contains all the dependencies required by the application.
Save the pom.xml
file.
Create the MapReduce application
Enter the command below to create and open a new file
WordCount.java
. Select Yes at the prompt to create a new file.notepad src\main\java\org\apache\hadoop\examples\WordCount.java
Then copy and paste the Java code below into the new file. Then close the file.
package org.apache.hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
Notice the package name is
org.apache.hadoop.examples
and the class name isWordCount
. You use these names when you submit the MapReduce job.
Build and package the application
From the wordcountjava
directory, use the following command to build a JAR file that contains the application:
mvn clean package
This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.
Once the command finishes, the wordcountjava/target
directory contains a file named wordcountjava-1.0-SNAPSHOT.jar
.
Note
The wordcountjava-1.0-SNAPSHOT.jar
file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.
Upload the JAR and run jobs (SSH)
The following steps use scp
to copy the JAR to the primary head node of your Apache HBase on HDInsight cluster. The ssh
command is then used to connect to the cluster and run the example directly on the head node.
Upload the jar to the cluster. Replace
CLUSTERNAME
with your HDInsight cluster name and then enter the following command:scp target/wordcountjava-1.0-SNAPSHOT.jar sshuser@CLUSTERNAME-ssh.azurehdinsight.net:
Connect to the cluster. Replace
CLUSTERNAME
with your HDInsight cluster name and then enter the following command:ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
From the SSH session, use the following command to run the MapReduce application:
yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
This command starts the WordCount MapReduce application. The input file is
/example/data/gutenberg/davinci.txt
, and the output directory is/example/data/wordcountout
. Both the input file and output are stored to the default storage for the cluster.Once the job completes, use the following command to view the results:
hdfs dfs -cat /example/data/wordcountout/*
You should receive a list of words and counts, with values similar to the following text:
zeal 1 zelus 1 zenith 2
Next steps
In this document, you have learned how to develop a Java MapReduce job. See the following documents for other ways to work with HDInsight.