Skip to content

Commit 0fcdcb1

Browse files
Merge pull request #101441 from dagiro/freshness179
freshness179
2 parents 432147b + e031ce3 commit 0fcdcb1

File tree

1 file changed

+147
-164
lines changed

1 file changed

+147
-164
lines changed
Lines changed: 147 additions & 164 deletions
Original file line numberDiff line numberDiff line change
@@ -1,153 +1,155 @@
11
---
22
title: Create Java MapReduce for Apache Hadoop - Azure HDInsight
33
description: Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Hadoop on Azure HDInsight.
4-
ms.reviewer: jasonh
54
author: hrasheed-msft
6-
5+
ms.author: hrasheed
6+
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdinsightactive,hdiseo17may2017
98
ms.topic: conceptual
10-
ms.date: 06/13/2019
11-
ms.author: hrasheed
12-
9+
ms.custom: hdinsightactive,hdiseo17may2017
10+
ms.date: 01/16/2020
1311
---
12+
1413
# Develop Java MapReduce programs for Apache Hadoop on HDInsight
1514

1615
Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.
1716

18-
> [!NOTE]
19-
> This example was most recently tested on HDInsight 3.6.
20-
21-
## <a name="prerequisites"></a>Prerequisites
17+
## Prerequisites
2218

23-
* [Java JDK](https://www.oracle.com/technetwork/java/javase/downloads/) 8 or later (or an equivalent, such as OpenJDK).
24-
25-
> [!NOTE]
26-
> HDInsight versions 3.4 and earlier use Java 7. HDInsight 3.5 and greater uses Java 8.
19+
* [Java Developer Kit (JDK) version 8](https://aka.ms/azure-jdks).
2720

28-
* [Apache Maven](https://maven.apache.org/)
21+
* [Apache Maven](https://maven.apache.org/download.cgi) properly [installed](https://maven.apache.org/install.html) according to Apache. Maven is a project build system for Java projects.
2922

3023
## Configure development environment
3124

32-
The following environment variables may be set when you install Java and the JDK. However, you should check that they exist and that they contain the correct values for your system.
25+
The environment used for this article was a computer running Windows 10. The commands were executed in a command prompt, and the various files were edited with Notepad. Modify accordingly for your environment.
3326

34-
* `JAVA_HOME` - should point to the directory where the Java runtime environment (JRE) is installed. For example, on an OS X, Unix or Linux system, it should have a value similar to `/usr/lib/jvm/java-7-oracle`. In Windows, it would have a value similar to `c:\Program Files (x86)\Java\jre1.7`
27+
From a command prompt, enter the commands below to create a working environment:
3528

36-
* `PATH` - should contain the following paths:
37-
38-
* `JAVA_HOME` (or the equivalent path)
39-
40-
* `JAVA_HOME\bin` (or the equivalent path)
41-
42-
* The directory where Maven is installed
29+
```cmd
30+
IF NOT EXIST C:\HDI MKDIR C:\HDI
31+
cd C:\HDI
32+
```
4333

4434
## Create a Maven project
4535

46-
1. From a terminal session, or command line in your development environment, change directories to the location you want to store this project.
47-
48-
2. Use the `mvn` command, which is installed with Maven, to generate the scaffolding for the project.
36+
1. Enter the following command to create a Maven project named **wordcountjava**:
4937

5038
```bash
5139
mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
5240
```
5341

54-
> [!NOTE]
55-
> If you are using PowerShell, you must enclose the `-D` parameters in double quotes.
56-
>
57-
> `mvn archetype:generate "-DgroupId=org.apache.hadoop.examples" "-DartifactId=wordcountjava" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"`
58-
5942
This command creates a directory with the name specified by the `artifactID` parameter (**wordcountjava** in this example.) This directory contains the following items:
6043

61-
* `pom.xml` - The [Project Object Model (POM)](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html) that contains information and configuration details used to build the project.
62-
63-
* `src` - The directory that contains the application.
64-
65-
3. Delete the `src/test/java/org/apache/hadoop/examples/apptest.java` file. It is not used in this example.
66-
67-
## Add dependencies
68-
69-
1. Edit the `pom.xml` file and add the following text inside the `<dependencies>` section:
70-
71-
```xml
72-
<dependency>
73-
<groupId>org.apache.hadoop</groupId>
74-
<artifactId>hadoop-mapreduce-examples</artifactId>
75-
<version>2.7.3</version>
76-
<scope>provided</scope>
77-
</dependency>
78-
<dependency>
79-
<groupId>org.apache.hadoop</groupId>
80-
<artifactId>hadoop-mapreduce-client-common</artifactId>
81-
<version>2.7.3</version>
82-
<scope>provided</scope>
83-
</dependency>
84-
<dependency>
85-
<groupId>org.apache.hadoop</groupId>
86-
<artifactId>hadoop-common</artifactId>
87-
<version>2.7.3</version>
88-
<scope>provided</scope>
89-
</dependency>
90-
```
44+
* `pom.xml` - The [Project Object Model (POM)](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html) that contains information and configuration details used to build the project.
45+
* src\main\java\org\apache\hadoop\examples: Contains your application code.
46+
* src\test\java\org\apache\hadoop\examples: Contains tests for your application.
9147

92-
This defines required libraries (listed within &lt;artifactId\>) with a specific version (listed within &lt;version\>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the [Maven repository search](https://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-mapreduce-examples%7C2.5.1%7Cjar) to view more.
93-
94-
The `<scope>provided</scope>` tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.
95-
96-
> [!IMPORTANT]
97-
> The version used should match the version of Hadoop present on your cluster. For more information on versions, see the [HDInsight component versioning](../hdinsight-component-versioning.md) document.
98-
99-
2. Add the following to the `pom.xml` file. This text must be inside the `<project>...</project>` tags in the file; for example, between `</dependencies>` and `</project>`.
100-
101-
```xml
102-
<build>
103-
<plugins>
104-
<plugin>
105-
<groupId>org.apache.maven.plugins</groupId>
106-
<artifactId>maven-shade-plugin</artifactId>
107-
<version>2.3</version>
108-
<configuration>
109-
<transformers>
110-
<transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
111-
</transformer>
112-
</transformers>
113-
</configuration>
114-
<executions>
115-
<execution>
116-
<phase>package</phase>
117-
<goals>
118-
<goal>shade</goal>
119-
</goals>
120-
</execution>
121-
</executions>
122-
</plugin>
123-
<plugin>
124-
<groupId>org.apache.maven.plugins</groupId>
125-
<artifactId>maven-compiler-plugin</artifactId>
126-
<version>3.6.1</version>
127-
<configuration>
128-
<source>1.8</source>
129-
<target>1.8</target>
130-
</configuration>
131-
</plugin>
132-
</plugins>
133-
</build>
134-
```
48+
1. Remove the generated example code. Delete the generated test and application files `AppTest.java`, and `App.java` by entering the commands below:
13549

136-
The first plugin configures the [Maven Shade Plugin](https://maven.apache.org/plugins/maven-shade-plugin/), which is used to build an uberjar (sometimes called a fatjar), which contains dependencies required by the application. It also prevents duplication of licenses within the jar package, which can cause problems on some systems.
50+
```cmd
51+
cd wordcountjava
52+
DEL src\main\java\org\apache\hadoop\examples\App.java
53+
DEL src\test\java\org\apache\hadoop\examples\AppTest.java
54+
```
13755
138-
The second plugin configures the target Java version.
56+
## Update the Project Object Model
57+
58+
For a full reference of the pom.xml file, see https://maven.apache.org/pom.html. Open `pom.xml` by entering the command below:
59+
60+
```cmd
61+
notepad pom.xml
62+
```
63+
64+
### Add dependencies
65+
66+
In `pom.xml`, add the following text in the `<dependencies>` section:
67+
68+
```xml
69+
<dependency>
70+
<groupId>org.apache.hadoop</groupId>
71+
<artifactId>hadoop-mapreduce-examples</artifactId>
72+
<version>2.7.3</version>
73+
<scope>provided</scope>
74+
</dependency>
75+
<dependency>
76+
<groupId>org.apache.hadoop</groupId>
77+
<artifactId>hadoop-mapreduce-client-common</artifactId>
78+
<version>2.7.3</version>
79+
<scope>provided</scope>
80+
</dependency>
81+
<dependency>
82+
<groupId>org.apache.hadoop</groupId>
83+
<artifactId>hadoop-common</artifactId>
84+
<version>2.7.3</version>
85+
<scope>provided</scope>
86+
</dependency>
87+
```
88+
89+
This defines required libraries (listed within &lt;artifactId\>) with a specific version (listed within &lt;version\>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the [Maven repository search](https://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-mapreduce-examples%7C2.5.1%7Cjar) to view more.
90+
91+
The `<scope>provided</scope>` tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time.
92+
93+
> [!IMPORTANT]
94+
> The version used should match the version of Hadoop present on your cluster. For more information on versions, see the [HDInsight component versioning](../hdinsight-component-versioning.md) document.
95+
96+
### Build configuration
97+
98+
Maven plug-ins allow you to customize the build stages of the project. This section is used to add plug-ins, resources, and other build configuration options.
99+
100+
Add the following code to the `pom.xml` file, and then save and close the file. This text must be inside the `<project>...</project>` tags in the file, for example, between `</dependencies>` and `</project>`.
101+
102+
```xml
103+
<build>
104+
<plugins>
105+
<plugin>
106+
<groupId>org.apache.maven.plugins</groupId>
107+
<artifactId>maven-shade-plugin</artifactId>
108+
<version>2.3</version>
109+
<configuration>
110+
<transformers>
111+
<transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer">
112+
</transformer>
113+
</transformers>
114+
</configuration>
115+
<executions>
116+
<execution>
117+
<phase>package</phase>
118+
<goals>
119+
<goal>shade</goal>
120+
</goals>
121+
</execution>
122+
</executions>
123+
</plugin>
124+
<plugin>
125+
<groupId>org.apache.maven.plugins</groupId>
126+
<artifactId>maven-compiler-plugin</artifactId>
127+
<version>3.6.1</version>
128+
<configuration>
129+
<source>1.8</source>
130+
<target>1.8</target>
131+
</configuration>
132+
</plugin>
133+
</plugins>
134+
</build>
135+
```
139136

140-
> [!NOTE]
141-
> HDInsight 3.4 and earlier use Java 7. HDInsight 3.5 and greater uses Java 8.
137+
This section configures the Apache Maven Compiler Plugin and Apache Maven Shade Plugin. The compiler plug-in is used to compile the topology. The shade plug-in is used to prevent license duplication in the JAR package that is built by Maven. This plugin is used to prevent a "duplicate license files" error at run time on the HDInsight cluster. Using maven-shade-plugin with the `ApacheLicenseResourceTransformer` implementation prevents the error.
142138

143-
3. Save the `pom.xml` file.
139+
The maven-shade-plugin also produces an uber jar that contains all the dependencies required by the application.
140+
141+
Save the `pom.xml` file.
144142

145143
## Create the MapReduce application
146144

147-
1. Go to the `wordcountjava/src/main/java/org/apache/hadoop/examples` directory and rename the `App.java` file to `WordCount.java`.
145+
1. Enter the command below to create and open a new file `WordCount.java`. Select **Yes** at the prompt to create a new file.
146+
147+
```cmd
148+
notepad src\main\java\org\apache\hadoop\examples\WordCount.java
149+
```
150+
151+
2. Then copy and paste the java code below into the new file. Then close the file.
148152
149-
2. Open the `WordCount.java` file in a text editor and replace the contents with the following text:
150-
151153
```java
152154
package org.apache.hadoop.examples;
153155
@@ -218,85 +220,66 @@ The following environment variables may be set when you install Java and the JDK
218220
}
219221
}
220222
```
221-
222-
Notice the package name is `org.apache.hadoop.examples` and the class name is `WordCount`. You use these names when you submit the MapReduce job.
223223
224-
3. Save the file.
224+
Notice the package name is `org.apache.hadoop.examples` and the class name is `WordCount`. You use these names when you submit the MapReduce job.
225225
226-
## Build the application
226+
## Build and package the application
227227
228-
1. Change to the `wordcountjava` directory, if you are not already there.
228+
From the `wordcountjava` directory, use the following command to build a JAR file that contains the application:
229229
230-
2. Use the following command to build a JAR file containing the application:
230+
```cmd
231+
mvn clean package
232+
```
231233

232-
```
233-
mvn clean package
234-
```
234+
This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.
235235

236-
This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application.
236+
Once the command finishes, the `wordcountjava/target` directory contains a file named `wordcountjava-1.0-SNAPSHOT.jar`.
237237

238-
3. Once the command finishes, the `wordcountjava/target` directory contains a file named `wordcountjava-1.0-SNAPSHOT.jar`.
239-
240-
> [!NOTE]
241-
> The `wordcountjava-1.0-SNAPSHOT.jar` file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.
238+
> [!NOTE]
239+
> The `wordcountjava-1.0-SNAPSHOT.jar` file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime.
242240
243-
## <a id="upload"></a>Upload the jar
241+
## Upload the JAR and run jobs (SSH)
244242

245-
Use the following command to upload the jar file to the HDInsight headnode:
243+
The following steps use `scp` to copy the JAR to the primary head node of your Apache HBase on HDInsight cluster. The `ssh` command is then used to connect to the cluster and run the example directly on the head node.
246244

247-
```bash
248-
scp target/wordcountjava-1.0-SNAPSHOT.jar [email protected]:
249-
```
245+
1. Upload the jar to the cluster. Replace `CLUSTERNAME` with your HDInsight cluster name and then enter the following command:
250246

251-
Replace __USERNAME__ with your SSH user name for the cluster. Replace __CLUSTERNAME__ with the HDInsight cluster name.
247+
```cmd
248+
scp target/wordcountjava-1.0-SNAPSHOT.jar [email protected]:
249+
```
252250
253-
This command copies the files from the local system to the head node. For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
251+
1. Connect to the cluster. Replace `CLUSTERNAME` with your HDInsight cluster name and then enter the following command:
254252
255-
## <a name="run"></a>Run the MapReduce job on Hadoop
253+
```cmd
254+
255+
```
256256
257-
1. Connect to HDInsight using SSH. For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
257+
1. From the SSH session, use the following command to run the MapReduce application:
258258
259-
2. From the SSH session, use the following command to run the MapReduce application:
260-
261259
```bash
262260
yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
263261
```
264-
262+
265263
This command starts the WordCount MapReduce application. The input file is `/example/data/gutenberg/davinci.txt`, and the output directory is `/example/data/wordcountout`. Both the input file and output are stored to the default storage for the cluster.
266264

267-
3. Once the job completes, use the following command to view the results:
268-
265+
1. Once the job completes, use the following command to view the results:
266+
269267
```bash
270268
hdfs dfs -cat /example/data/wordcountout/*
271269
```
272270

273271
You should receive a list of words and counts, with values similar to the following text:
274-
275-
zeal 1
276-
zelus 1
277-
zenith 2
278272

279-
## <a id="nextsteps"></a>Next steps
273+
```output
274+
zeal 1
275+
zelus 1
276+
zenith 2
277+
```
278+
279+
## Next steps
280280
281281
In this document, you have learned how to develop a Java MapReduce job. See the following documents for other ways to work with HDInsight.
282282
283283
* [Use Apache Hive with HDInsight](hdinsight-use-hive.md)
284-
* [Use Apache Pig with HDInsight](hdinsight-use-pig.md)
285284
* [Use MapReduce with HDInsight](hdinsight-use-mapreduce.md)
286-
287-
For more information, see also the [Java Developer Center](https://azure.microsoft.com/develop/java/).
288-
289-
[azure-purchase-options]: https://azure.microsoft.com/pricing/purchase-options/
290-
[azure-member-offers]: https://azure.microsoft.com/pricing/member-offers/
291-
[azure-free-trial]: https://azure.microsoft.com/pricing/free-trial/
292-
293-
[hdinsight-use-sqoop]:hdinsight-use-sqoop.md
294-
[hdinsight-ODBC]: hdinsight-connect-excel-hive-ODBC-driver.md
295-
[hdinsight-power-query]:apache-hadoop-connect-excel-power-query.md
296-
297-
[hdinsight-upload-data]: hdinsight-upload-data.md
298-
[hdinsight-admin-powershell]: hdinsight-administer-use-powershell.md
299-
[hdinsight-power-query]:apache-hadoop-connect-excel-power-query.md
300-
301-
[powershell-PSCredential]: https://social.technet.microsoft.com/wiki/contents/articles/4546.working-with-passwords-secure-strings-and-credentials-in-windows-powershell.aspx
302-
285+
* [Java Developer Center](https://azure.microsoft.com/develop/java/)

0 commit comments

Comments
 (0)