|
1 | 1 | ---
|
2 | 2 | title: Create Java MapReduce for Apache Hadoop - Azure HDInsight
|
3 | 3 | description: Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Hadoop on Azure HDInsight.
|
4 |
| -ms.reviewer: jasonh |
5 | 4 | author: hrasheed-msft
|
6 |
| - |
| 5 | +ms.author: hrasheed |
| 6 | +ms.reviewer: jasonh |
7 | 7 | ms.service: hdinsight
|
8 |
| -ms.custom: hdinsightactive,hdiseo17may2017 |
9 | 8 | ms.topic: conceptual
|
10 |
| -ms.date: 06/13/2019 |
11 |
| -ms.author: hrasheed |
12 |
| - |
| 9 | +ms.custom: hdinsightactive,hdiseo17may2017 |
| 10 | +ms.date: 01/16/2020 |
13 | 11 | ---
|
| 12 | + |
14 | 13 | # Develop Java MapReduce programs for Apache Hadoop on HDInsight
|
15 | 14 |
|
16 | 15 | Learn how to use Apache Maven to create a Java-based MapReduce application, then run it with Apache Hadoop on Azure HDInsight.
|
17 | 16 |
|
18 |
| -> [!NOTE] |
19 |
| -> This example was most recently tested on HDInsight 3.6. |
20 |
| -
|
21 |
| -## <a name="prerequisites"></a>Prerequisites |
| 17 | +## Prerequisites |
22 | 18 |
|
23 |
| -* [Java JDK](https://www.oracle.com/technetwork/java/javase/downloads/) 8 or later (or an equivalent, such as OpenJDK). |
24 |
| - |
25 |
| - > [!NOTE] |
26 |
| - > HDInsight versions 3.4 and earlier use Java 7. HDInsight 3.5 and greater uses Java 8. |
| 19 | +* [Java Developer Kit (JDK) version 8](https://aka.ms/azure-jdks). |
27 | 20 |
|
28 |
| -* [Apache Maven](https://maven.apache.org/) |
| 21 | +* [Apache Maven](https://maven.apache.org/download.cgi) properly [installed](https://maven.apache.org/install.html) according to Apache. Maven is a project build system for Java projects. |
29 | 22 |
|
30 | 23 | ## Configure development environment
|
31 | 24 |
|
32 |
| -The following environment variables may be set when you install Java and the JDK. However, you should check that they exist and that they contain the correct values for your system. |
| 25 | +The environment used for this article was a computer running Windows 10. The commands were executed in a command prompt, and the various files were edited with Notepad. Modify accordingly for your environment. |
33 | 26 |
|
34 |
| -* `JAVA_HOME` - should point to the directory where the Java runtime environment (JRE) is installed. For example, on an OS X, Unix or Linux system, it should have a value similar to `/usr/lib/jvm/java-7-oracle`. In Windows, it would have a value similar to `c:\Program Files (x86)\Java\jre1.7` |
| 27 | +From a command prompt, enter the commands below to create a working environment: |
35 | 28 |
|
36 |
| -* `PATH` - should contain the following paths: |
37 |
| - |
38 |
| - * `JAVA_HOME` (or the equivalent path) |
39 |
| - |
40 |
| - * `JAVA_HOME\bin` (or the equivalent path) |
41 |
| - |
42 |
| - * The directory where Maven is installed |
| 29 | +```cmd |
| 30 | +IF NOT EXIST C:\HDI MKDIR C:\HDI |
| 31 | +cd C:\HDI |
| 32 | +``` |
43 | 33 |
|
44 | 34 | ## Create a Maven project
|
45 | 35 |
|
46 |
| -1. From a terminal session, or command line in your development environment, change directories to the location you want to store this project. |
47 |
| - |
48 |
| -2. Use the `mvn` command, which is installed with Maven, to generate the scaffolding for the project. |
| 36 | +1. Enter the following command to create a Maven project named **wordcountjava**: |
49 | 37 |
|
50 | 38 | ```bash
|
51 | 39 | mvn archetype:generate -DgroupId=org.apache.hadoop.examples -DartifactId=wordcountjava -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
|
52 | 40 | ```
|
53 | 41 |
|
54 |
| - > [!NOTE] |
55 |
| - > If you are using PowerShell, you must enclose the `-D` parameters in double quotes. |
56 |
| - > |
57 |
| - > `mvn archetype:generate "-DgroupId=org.apache.hadoop.examples" "-DartifactId=wordcountjava" "-DarchetypeArtifactId=maven-archetype-quickstart" "-DinteractiveMode=false"` |
58 |
| -
|
59 | 42 | This command creates a directory with the name specified by the `artifactID` parameter (**wordcountjava** in this example.) This directory contains the following items:
|
60 | 43 |
|
61 |
| - * `pom.xml` - The [Project Object Model (POM)](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html) that contains information and configuration details used to build the project. |
62 |
| - |
63 |
| - * `src` - The directory that contains the application. |
64 |
| - |
65 |
| -3. Delete the `src/test/java/org/apache/hadoop/examples/apptest.java` file. It is not used in this example. |
66 |
| - |
67 |
| -## Add dependencies |
68 |
| - |
69 |
| -1. Edit the `pom.xml` file and add the following text inside the `<dependencies>` section: |
70 |
| - |
71 |
| - ```xml |
72 |
| - <dependency> |
73 |
| - <groupId>org.apache.hadoop</groupId> |
74 |
| - <artifactId>hadoop-mapreduce-examples</artifactId> |
75 |
| - <version>2.7.3</version> |
76 |
| - <scope>provided</scope> |
77 |
| - </dependency> |
78 |
| - <dependency> |
79 |
| - <groupId>org.apache.hadoop</groupId> |
80 |
| - <artifactId>hadoop-mapreduce-client-common</artifactId> |
81 |
| - <version>2.7.3</version> |
82 |
| - <scope>provided</scope> |
83 |
| - </dependency> |
84 |
| - <dependency> |
85 |
| - <groupId>org.apache.hadoop</groupId> |
86 |
| - <artifactId>hadoop-common</artifactId> |
87 |
| - <version>2.7.3</version> |
88 |
| - <scope>provided</scope> |
89 |
| - </dependency> |
90 |
| - ``` |
| 44 | + * `pom.xml` - The [Project Object Model (POM)](https://maven.apache.org/guides/introduction/introduction-to-the-pom.html) that contains information and configuration details used to build the project. |
| 45 | + * src\main\java\org\apache\hadoop\examples: Contains your application code. |
| 46 | + * src\test\java\org\apache\hadoop\examples: Contains tests for your application. |
91 | 47 |
|
92 |
| - This defines required libraries (listed within <artifactId\>) with a specific version (listed within <version\>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the [Maven repository search](https://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-mapreduce-examples%7C2.5.1%7Cjar) to view more. |
93 |
| - |
94 |
| - The `<scope>provided</scope>` tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time. |
95 |
| - |
96 |
| - > [!IMPORTANT] |
97 |
| - > The version used should match the version of Hadoop present on your cluster. For more information on versions, see the [HDInsight component versioning](../hdinsight-component-versioning.md) document. |
98 |
| -
|
99 |
| -2. Add the following to the `pom.xml` file. This text must be inside the `<project>...</project>` tags in the file; for example, between `</dependencies>` and `</project>`. |
100 |
| - |
101 |
| - ```xml |
102 |
| - <build> |
103 |
| - <plugins> |
104 |
| - <plugin> |
105 |
| - <groupId>org.apache.maven.plugins</groupId> |
106 |
| - <artifactId>maven-shade-plugin</artifactId> |
107 |
| - <version>2.3</version> |
108 |
| - <configuration> |
109 |
| - <transformers> |
110 |
| - <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"> |
111 |
| - </transformer> |
112 |
| - </transformers> |
113 |
| - </configuration> |
114 |
| - <executions> |
115 |
| - <execution> |
116 |
| - <phase>package</phase> |
117 |
| - <goals> |
118 |
| - <goal>shade</goal> |
119 |
| - </goals> |
120 |
| - </execution> |
121 |
| - </executions> |
122 |
| - </plugin> |
123 |
| - <plugin> |
124 |
| - <groupId>org.apache.maven.plugins</groupId> |
125 |
| - <artifactId>maven-compiler-plugin</artifactId> |
126 |
| - <version>3.6.1</version> |
127 |
| - <configuration> |
128 |
| - <source>1.8</source> |
129 |
| - <target>1.8</target> |
130 |
| - </configuration> |
131 |
| - </plugin> |
132 |
| - </plugins> |
133 |
| - </build> |
134 |
| - ``` |
| 48 | +1. Remove the generated example code. Delete the generated test and application files `AppTest.java`, and `App.java` by entering the commands below: |
135 | 49 |
|
136 |
| - The first plugin configures the [Maven Shade Plugin](https://maven.apache.org/plugins/maven-shade-plugin/), which is used to build an uberjar (sometimes called a fatjar), which contains dependencies required by the application. It also prevents duplication of licenses within the jar package, which can cause problems on some systems. |
| 50 | + ```cmd |
| 51 | + cd wordcountjava |
| 52 | + DEL src\main\java\org\apache\hadoop\examples\App.java |
| 53 | + DEL src\test\java\org\apache\hadoop\examples\AppTest.java |
| 54 | + ``` |
137 | 55 |
|
138 |
| - The second plugin configures the target Java version. |
| 56 | +## Update the Project Object Model |
| 57 | +
|
| 58 | +For a full reference of the pom.xml file, see https://maven.apache.org/pom.html. Open `pom.xml` by entering the command below: |
| 59 | +
|
| 60 | +```cmd |
| 61 | +notepad pom.xml |
| 62 | +``` |
| 63 | + |
| 64 | +### Add dependencies |
| 65 | + |
| 66 | +In `pom.xml`, add the following text in the `<dependencies>` section: |
| 67 | + |
| 68 | +```xml |
| 69 | +<dependency> |
| 70 | + <groupId>org.apache.hadoop</groupId> |
| 71 | + <artifactId>hadoop-mapreduce-examples</artifactId> |
| 72 | + <version>2.7.3</version> |
| 73 | + <scope>provided</scope> |
| 74 | +</dependency> |
| 75 | +<dependency> |
| 76 | + <groupId>org.apache.hadoop</groupId> |
| 77 | + <artifactId>hadoop-mapreduce-client-common</artifactId> |
| 78 | + <version>2.7.3</version> |
| 79 | + <scope>provided</scope> |
| 80 | +</dependency> |
| 81 | +<dependency> |
| 82 | + <groupId>org.apache.hadoop</groupId> |
| 83 | + <artifactId>hadoop-common</artifactId> |
| 84 | + <version>2.7.3</version> |
| 85 | + <scope>provided</scope> |
| 86 | +</dependency> |
| 87 | +``` |
| 88 | + |
| 89 | +This defines required libraries (listed within <artifactId\>) with a specific version (listed within <version\>). At compile time, these dependencies are downloaded from the default Maven repository. You can use the [Maven repository search](https://search.maven.org/#artifactdetails%7Corg.apache.hadoop%7Chadoop-mapreduce-examples%7C2.5.1%7Cjar) to view more. |
| 90 | + |
| 91 | +The `<scope>provided</scope>` tells Maven that these dependencies should not be packaged with the application, as they are provided by the HDInsight cluster at run-time. |
| 92 | + |
| 93 | +> [!IMPORTANT] |
| 94 | +> The version used should match the version of Hadoop present on your cluster. For more information on versions, see the [HDInsight component versioning](../hdinsight-component-versioning.md) document. |
| 95 | +
|
| 96 | +### Build configuration |
| 97 | + |
| 98 | +Maven plug-ins allow you to customize the build stages of the project. This section is used to add plug-ins, resources, and other build configuration options. |
| 99 | + |
| 100 | +Add the following code to the `pom.xml` file, and then save and close the file. This text must be inside the `<project>...</project>` tags in the file, for example, between `</dependencies>` and `</project>`. |
| 101 | + |
| 102 | +```xml |
| 103 | +<build> |
| 104 | + <plugins> |
| 105 | + <plugin> |
| 106 | + <groupId>org.apache.maven.plugins</groupId> |
| 107 | + <artifactId>maven-shade-plugin</artifactId> |
| 108 | + <version>2.3</version> |
| 109 | + <configuration> |
| 110 | + <transformers> |
| 111 | + <transformer implementation="org.apache.maven.plugins.shade.resource.ApacheLicenseResourceTransformer"> |
| 112 | + </transformer> |
| 113 | + </transformers> |
| 114 | + </configuration> |
| 115 | + <executions> |
| 116 | + <execution> |
| 117 | + <phase>package</phase> |
| 118 | + <goals> |
| 119 | + <goal>shade</goal> |
| 120 | + </goals> |
| 121 | + </execution> |
| 122 | + </executions> |
| 123 | + </plugin> |
| 124 | + <plugin> |
| 125 | + <groupId>org.apache.maven.plugins</groupId> |
| 126 | + <artifactId>maven-compiler-plugin</artifactId> |
| 127 | + <version>3.6.1</version> |
| 128 | + <configuration> |
| 129 | + <source>1.8</source> |
| 130 | + <target>1.8</target> |
| 131 | + </configuration> |
| 132 | + </plugin> |
| 133 | + </plugins> |
| 134 | +</build> |
| 135 | +``` |
139 | 136 |
|
140 |
| - > [!NOTE] |
141 |
| - > HDInsight 3.4 and earlier use Java 7. HDInsight 3.5 and greater uses Java 8. |
| 137 | +This section configures the Apache Maven Compiler Plugin and Apache Maven Shade Plugin. The compiler plug-in is used to compile the topology. The shade plug-in is used to prevent license duplication in the JAR package that is built by Maven. This plugin is used to prevent a "duplicate license files" error at run time on the HDInsight cluster. Using maven-shade-plugin with the `ApacheLicenseResourceTransformer` implementation prevents the error. |
142 | 138 |
|
143 |
| -3. Save the `pom.xml` file. |
| 139 | +The maven-shade-plugin also produces an uber jar that contains all the dependencies required by the application. |
| 140 | + |
| 141 | +Save the `pom.xml` file. |
144 | 142 |
|
145 | 143 | ## Create the MapReduce application
|
146 | 144 |
|
147 |
| -1. Go to the `wordcountjava/src/main/java/org/apache/hadoop/examples` directory and rename the `App.java` file to `WordCount.java`. |
| 145 | +1. Enter the command below to create and open a new file `WordCount.java`. Select **Yes** at the prompt to create a new file. |
| 146 | + |
| 147 | + ```cmd |
| 148 | + notepad src\main\java\org\apache\hadoop\examples\WordCount.java |
| 149 | + ``` |
| 150 | +
|
| 151 | +2. Then copy and paste the java code below into the new file. Then close the file. |
148 | 152 |
|
149 |
| -2. Open the `WordCount.java` file in a text editor and replace the contents with the following text: |
150 |
| - |
151 | 153 | ```java
|
152 | 154 | package org.apache.hadoop.examples;
|
153 | 155 |
|
@@ -218,85 +220,66 @@ The following environment variables may be set when you install Java and the JDK
|
218 | 220 | }
|
219 | 221 | }
|
220 | 222 | ```
|
221 |
| - |
222 |
| - Notice the package name is `org.apache.hadoop.examples` and the class name is `WordCount`. You use these names when you submit the MapReduce job. |
223 | 223 |
|
224 |
| -3. Save the file. |
| 224 | + Notice the package name is `org.apache.hadoop.examples` and the class name is `WordCount`. You use these names when you submit the MapReduce job. |
225 | 225 |
|
226 |
| -## Build the application |
| 226 | +## Build and package the application |
227 | 227 |
|
228 |
| -1. Change to the `wordcountjava` directory, if you are not already there. |
| 228 | +From the `wordcountjava` directory, use the following command to build a JAR file that contains the application: |
229 | 229 |
|
230 |
| -2. Use the following command to build a JAR file containing the application: |
| 230 | +```cmd |
| 231 | +mvn clean package |
| 232 | +``` |
231 | 233 |
|
232 |
| - ``` |
233 |
| - mvn clean package |
234 |
| - ``` |
| 234 | +This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application. |
235 | 235 |
|
236 |
| - This command cleans any previous build artifacts, downloads any dependencies that have not already been installed, and then builds and package the application. |
| 236 | +Once the command finishes, the `wordcountjava/target` directory contains a file named `wordcountjava-1.0-SNAPSHOT.jar`. |
237 | 237 |
|
238 |
| -3. Once the command finishes, the `wordcountjava/target` directory contains a file named `wordcountjava-1.0-SNAPSHOT.jar`. |
239 |
| - |
240 |
| - > [!NOTE] |
241 |
| - > The `wordcountjava-1.0-SNAPSHOT.jar` file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime. |
| 238 | +> [!NOTE] |
| 239 | +> The `wordcountjava-1.0-SNAPSHOT.jar` file is an uberjar, which contains not only the WordCount job, but also dependencies that the job requires at runtime. |
242 | 240 |
|
243 |
| -## <a id="upload"></a>Upload the jar |
| 241 | +## Upload the JAR and run jobs (SSH) |
244 | 242 |
|
245 |
| -Use the following command to upload the jar file to the HDInsight headnode: |
| 243 | +The following steps use `scp` to copy the JAR to the primary head node of your Apache HBase on HDInsight cluster. The `ssh` command is then used to connect to the cluster and run the example directly on the head node. |
246 | 244 |
|
247 |
| - ```bash |
248 |
| - scp target/wordcountjava-1.0-SNAPSHOT.jar [email protected]: |
249 |
| - ``` |
| 245 | +1. Upload the jar to the cluster. Replace `CLUSTERNAME` with your HDInsight cluster name and then enter the following command: |
250 | 246 |
|
251 |
| -Replace __USERNAME__ with your SSH user name for the cluster. Replace __CLUSTERNAME__ with the HDInsight cluster name. |
| 247 | + ```cmd |
| 248 | + scp target/wordcountjava-1.0-SNAPSHOT.jar [email protected]: |
| 249 | + ``` |
252 | 250 |
|
253 |
| -This command copies the files from the local system to the head node. For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md). |
| 251 | +1. Connect to the cluster. Replace `CLUSTERNAME` with your HDInsight cluster name and then enter the following command: |
254 | 252 |
|
255 |
| -## <a name="run"></a>Run the MapReduce job on Hadoop |
| 253 | + ```cmd |
| 254 | + |
| 255 | + ``` |
256 | 256 |
|
257 |
| -1. Connect to HDInsight using SSH. For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md). |
| 257 | +1. From the SSH session, use the following command to run the MapReduce application: |
258 | 258 |
|
259 |
| -2. From the SSH session, use the following command to run the MapReduce application: |
260 |
| - |
261 | 259 | ```bash
|
262 | 260 | yarn jar wordcountjava-1.0-SNAPSHOT.jar org.apache.hadoop.examples.WordCount /example/data/gutenberg/davinci.txt /example/data/wordcountout
|
263 | 261 | ```
|
264 |
| - |
| 262 | + |
265 | 263 | This command starts the WordCount MapReduce application. The input file is `/example/data/gutenberg/davinci.txt`, and the output directory is `/example/data/wordcountout`. Both the input file and output are stored to the default storage for the cluster.
|
266 | 264 |
|
267 |
| -3. Once the job completes, use the following command to view the results: |
268 |
| - |
| 265 | +1. Once the job completes, use the following command to view the results: |
| 266 | + |
269 | 267 | ```bash
|
270 | 268 | hdfs dfs -cat /example/data/wordcountout/*
|
271 | 269 | ```
|
272 | 270 |
|
273 | 271 | You should receive a list of words and counts, with values similar to the following text:
|
274 |
| - |
275 |
| - zeal 1 |
276 |
| - zelus 1 |
277 |
| - zenith 2 |
278 | 272 |
|
279 |
| -## <a id="nextsteps"></a>Next steps |
| 273 | + ```output |
| 274 | + zeal 1 |
| 275 | + zelus 1 |
| 276 | + zenith 2 |
| 277 | + ``` |
| 278 | +
|
| 279 | +## Next steps |
280 | 280 |
|
281 | 281 | In this document, you have learned how to develop a Java MapReduce job. See the following documents for other ways to work with HDInsight.
|
282 | 282 |
|
283 | 283 | * [Use Apache Hive with HDInsight](hdinsight-use-hive.md)
|
284 |
| -* [Use Apache Pig with HDInsight](hdinsight-use-pig.md) |
285 | 284 | * [Use MapReduce with HDInsight](hdinsight-use-mapreduce.md)
|
286 |
| -
|
287 |
| -For more information, see also the [Java Developer Center](https://azure.microsoft.com/develop/java/). |
288 |
| -
|
289 |
| -[azure-purchase-options]: https://azure.microsoft.com/pricing/purchase-options/ |
290 |
| -[azure-member-offers]: https://azure.microsoft.com/pricing/member-offers/ |
291 |
| -[azure-free-trial]: https://azure.microsoft.com/pricing/free-trial/ |
292 |
| -
|
293 |
| -[hdinsight-use-sqoop]:hdinsight-use-sqoop.md |
294 |
| -[hdinsight-ODBC]: hdinsight-connect-excel-hive-ODBC-driver.md |
295 |
| -[hdinsight-power-query]:apache-hadoop-connect-excel-power-query.md |
296 |
| -
|
297 |
| -[hdinsight-upload-data]: hdinsight-upload-data.md |
298 |
| -[hdinsight-admin-powershell]: hdinsight-administer-use-powershell.md |
299 |
| -[hdinsight-power-query]:apache-hadoop-connect-excel-power-query.md |
300 |
| -
|
301 |
| -[powershell-PSCredential]: https://social.technet.microsoft.com/wiki/contents/articles/4546.working-with-passwords-secure-strings-and-credentials-in-windows-powershell.aspx |
302 |
| -
|
| 285 | +* [Java Developer Center](https://azure.microsoft.com/develop/java/) |
0 commit comments