Skip to content

Commit 76acc86

Browse files
authored
Merge pull request #92561 from DennisLee-DennisLee/v-dele-1558066-008
1558066: Updated 3 HDInsight articles.
2 parents bce9b99 + ffed79d commit 76acc86

14 files changed

+287
-262
lines changed

articles/hdinsight/hadoop/apache-hadoop-dotnet-csharp-mapreduce-streaming.md

Lines changed: 105 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.reviewer: jasonh
66
ms.custom: hdinsightactive
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 02/15/2019
9+
ms.date: 10/17/2019
1010
ms.author: hrasheed
1111
---
1212

@@ -15,41 +15,43 @@ ms.author: hrasheed
1515
Learn how to use C# to create a MapReduce solution on HDInsight.
1616

1717
> [!IMPORTANT]
18-
> Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see [HDInsight component versioning](../hdinsight-component-versioning.md).
18+
> Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see [Apache Hadoop components on HDInsight](../hdinsight-component-versioning.md).
1919
2020
Apache Hadoop streaming is a utility that allows you to run MapReduce jobs using a script or executable. In this example, .NET is used to implement the mapper and reducer for a word count solution.
2121

2222
## .NET on HDInsight
2323

24-
__Linux-based HDInsight__ clusters use [Mono (https://mono-project.com)](https://mono-project.com) to run .NET applications. Mono version 4.2.1 is included with HDInsight version 3.6. For more information on the version of Mono included with HDInsight, see [HDInsight component versions](../hdinsight-component-versioning.md).
24+
*Linux-based HDInsight* clusters use [Mono (https://mono-project.com)](https://mono-project.com) to run .NET applications. Mono version 4.2.1 is included with HDInsight version 3.6. For more information on the version of Mono included with HDInsight, see [Apache Hadoop components available with different HDInsight versions](../hdinsight-component-versioning.md#apache-hadoop-components-available-with-different-hdinsight-versions).
2525

2626
For more information on Mono compatibility with .NET Framework versions, see [Mono compatibility](https://www.mono-project.com/docs/about-mono/compatibility/).
2727

2828
## How Hadoop streaming works
2929

3030
The basic process used for streaming in this document is as follows:
3131

32-
1. Hadoop passes data to the mapper (mapper.exe in this example) on STDIN.
32+
1. Hadoop passes data to the mapper (*mapper.exe* in this example) on STDIN.
3333
2. The mapper processes the data, and emits tab-delimited key/value pairs to STDOUT.
34-
3. The output is read by Hadoop, and then passed to the reducer (reducer.exe in this example) on STDIN.
34+
3. The output is read by Hadoop, and then passed to the reducer (*reducer.exe* in this example) on STDIN.
3535
4. The reducer reads the tab-delimited key/value pairs, processes the data, and then emits the result as tab-delimited key/value pairs on STDOUT.
3636
5. The output is read by Hadoop and written to the output directory.
3737

3838
For more information on streaming, see [Hadoop Streaming](https://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html).
3939

4040
## Prerequisites
4141

42-
* A familiarity with writing and building C# code that targets .NET Framework 4.5. The steps in this document use Visual Studio 2017.
42+
* Visual Studio.
43+
44+
* A familiarity with writing and building C# code that targets .NET Framework 4.5.
4345

4446
* A way to upload .exe files to the cluster. The steps in this document use the Data Lake Tools for Visual Studio to upload the files to primary storage for the cluster.
4547

46-
* Azure PowerShell or an SSH client.
48+
* Azure PowerShell or a Secure Shell (SSH) client.
4749

4850
* A Hadoop on HDInsight cluster. For more information on creating a cluster, see [Create an HDInsight cluster](../hdinsight-hadoop-provision-linux-clusters.md).
4951

5052
## Create the mapper
5153

52-
In Visual Studio, create a new __Console application__ named __mapper__. Use the following code for the application:
54+
In Visual Studio, create a new .NET Framework console application named *mapper*. Use the following code for the application:
5355

5456
```csharp
5557
using System;
@@ -82,11 +84,11 @@ namespace mapper
8284
}
8385
```
8486

85-
After creating the application, build it to produce the `/bin/Debug/mapper.exe` file in the project directory.
87+
After you create the application, build it to produce the */bin/Debug/mapper.exe* file in the project directory.
8688

8789
## Create the reducer
8890

89-
In Visual Studio, create a new __Console application__ named __reducer__. Use the following code for the application:
91+
In Visual Studio, create a new .NET Framework console application named *reducer*. Use the following code for the application:
9092

9193
```csharp
9294
using System;
@@ -135,106 +137,131 @@ namespace reducer
135137
}
136138
```
137139

138-
After creating the application, build it to produce the `/bin/Debug/reducer.exe` file in the project directory.
140+
After you create the application, build it to produce the */bin/Debug/reducer.exe* file in the project directory.
139141

140142
## Upload to storage
141143

142-
1. In Visual Studio, open **Server Explorer**.
144+
Next, you need to upload the *mapper* and *reducer* applications to HDInsight storage.
145+
146+
1. In Visual Studio, choose **View** > **Server Explorer**.
143147

144148
2. Expand **Azure**, and then expand **HDInsight**.
145149

146-
3. If prompted, enter your Azure subscription credentials, and then click **Sign In**.
150+
3. If prompted, enter your Azure subscription credentials, and then select **Sign In**.
147151

148-
4. Expand the HDInsight cluster that you wish to deploy this application to. An entry with the text __(Default Storage Account)__ is listed.
152+
4. Expand the HDInsight cluster that you wish to deploy this application to. An entry with the text **(Default Storage Account)** is listed.
149153

150-
![Server Explorer showing the storage account for the cluster](./media/apache-hadoop-dotnet-csharp-mapreduce-streaming/hdinsight-storage-account.png)
154+
![Storage account, HDInsight cluster, Server Explorer, Visual Studio](./media/apache-hadoop-dotnet-csharp-mapreduce-streaming/hdinsight-storage-account.png)
151155

152-
* If this entry can be expanded, you are using an __Azure Storage Account__ as default storage for the cluster. To view the files on the default storage for the cluster, expand the entry and then double-click the __(Default Container)__.
156+
* If the **(Default Storage Account)** entry can be expanded, you're using an **Azure Storage Account** as default storage for the cluster. To view the files on the default storage for the cluster, expand the entry and then double-click **(Default Container)**.
153157

154-
* If this entry cannot be expanded, you are using __Azure Data Lake Storage__ as the default storage for the cluster. To view the files on the default storage for the cluster, double-click the __(Default Storage Account)__ entry.
158+
* If the **(Default Storage Account)** entry can't be expanded, you're using **Azure Data Lake Storage** as the default storage for the cluster. To view the files on the default storage for the cluster, double-click the **(Default Storage Account)** entry.
155159

156160
5. To upload the .exe files, use one of the following methods:
157161

158-
* If using an __Azure Storage Account__, click the upload icon, and then browse to the **bin\debug** folder for the **mapper** project. Finally, select the **mapper.exe** file and click **Ok**.
159-
160-
![HDInsight upload icon for mapper](./media/apache-hadoop-dotnet-csharp-mapreduce-streaming/hdinsight-upload-icon.png)
161-
162-
* If using __Azure Data Lake Storage__, right-click an empty area in the file listing, and then select __Upload__. Finally, select the **mapper.exe** file and click **Open**.
163-
164-
Once the __mapper.exe__ upload has finished, repeat the upload process for the __reducer.exe__ file.
162+
* If you're using an **Azure Storage Account**, select the **Upload Blob** icon.
165163

166-
## Run a job: Using an SSH session
167-
168-
1. Use SSH to connect to the HDInsight cluster. For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
169-
170-
2. Use one of the following commands to start the MapReduce job:
164+
![HDInsight upload icon for mapper, Visual Studio](./media/apache-hadoop-dotnet-csharp-mapreduce-streaming/hdinsight-upload-icon.png)
171165

172-
* If using __Data Lake Storage Gen2__ as default storage:
166+
In the **Upload New File** dialog box, under **File name**, select **Browse**. In the **Upload Blob** dialog box, go to the *bin\debug* folder for the *mapper* project, and then choose the *mapper.exe* file. Finally, select **Open** and then **OK** to complete the upload.
173167

174-
```bash
175-
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -files abfs:///mapper.exe,abfs:///reducer.exe -mapper mapper.exe -reducer reducer.exe -input /example/data/gutenberg/davinci.txt -output /example/wordcountout
176-
```
168+
* For **Azure Data Lake Storage**, right-click an empty area in the file listing, and then select **Upload**. Finally, select the *mapper.exe* file and then select **Open**.
177169

178-
* If using __Data Lake Storage Gen1__ as default storage:
170+
Once the *mapper.exe* upload has finished, repeat the upload process for the *reducer.exe* file.
179171

180-
```bash
181-
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -files adl:///mapper.exe,adl:///reducer.exe -mapper mapper.exe -reducer reducer.exe -input /example/data/gutenberg/davinci.txt -output /example/wordcountout
182-
```
183-
184-
* If using __Azure Storage__ as default storage:
185-
186-
```bash
187-
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -files wasb:///mapper.exe,wasb:///reducer.exe -mapper mapper.exe -reducer reducer.exe -input /example/data/gutenberg/davinci.txt -output /example/wordcountout
188-
```
189-
190-
The following list describes what each parameter does:
191-
192-
* `hadoop-streaming.jar`: The jar file that contains the streaming MapReduce functionality.
193-
* `-files`: Adds the `mapper.exe` and `reducer.exe` files to this job. The `abfs:///`,`adl:///` or `wasb:///` before each file is the path to the root of default storage for the cluster.
194-
* `-mapper`: Specifies which file implements the mapper.
195-
* `-reducer`: Specifies which file implements the reducer.
196-
* `-input`: The input data.
197-
* `-output`: The output directory.
172+
## Run a job: Using an SSH session
198173

199-
3. Once the MapReduce job completes, use the following to view the results:
174+
The following procedure describes how to run a MapReduce job using an SSH session:
200175

201-
```bash
202-
hdfs dfs -text /example/wordcountout/part-00000
203-
```
176+
1. Use SSH to connect to the HDInsight cluster. (For example, run the command `ssh sshuser@<clustername>-ssh.azurehdinsight.net`.) For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
204177

205-
The following text is an example of the data returned by this command:
178+
2. Use one of the following commands to start the MapReduce job:
206179

207-
you 1128
208-
young 38
209-
younger 1
210-
youngest 1
211-
your 338
212-
yours 4
213-
yourself 34
214-
yourselves 3
215-
youth 17
180+
* If the default storage is **Azure Storage**:
181+
182+
```bash
183+
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
184+
-files wasb:///mapper.exe,wasb:///reducer.exe \
185+
-mapper mapper.exe \
186+
-reducer reducer.exe \
187+
-input /example/data/gutenberg/davinci.txt \
188+
-output /example/wordcountout
189+
```
190+
191+
* If the default storage is **Data Lake Storage Gen1**:
192+
193+
```bash
194+
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
195+
-files adl:///mapper.exe,adl:///reducer.exe \
196+
-mapper mapper.exe \
197+
-reducer reducer.exe \
198+
-input /example/data/gutenberg/davinci.txt \
199+
-output /example/wordcountout
200+
```
201+
202+
* If the default storage is **Data Lake Storage Gen2**:
203+
204+
```bash
205+
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
206+
-files abfs:///mapper.exe,abfs:///reducer.exe \
207+
-mapper mapper.exe \
208+
-reducer reducer.exe \
209+
-input /example/data/gutenberg/davinci.txt \
210+
-output /example/wordcountout
211+
```
212+
213+
The following list describes what each parameter and option represents:
214+
215+
* *hadoop-streaming.jar*: Specifies the jar file that contains the streaming MapReduce functionality.
216+
* `-files`: Specifies the *mapper.exe* and *reducer.exe* files for this job. The `wasb:///`, `adl:///`, or `abfs:///` protocol declaration before each file is the path to the root of default storage for the cluster.
217+
* `-mapper`: Specifies the file that implements the mapper.
218+
* `-reducer`: Specifies the file that implements the reducer.
219+
* `-input`: Specifies the input data.
220+
* `-output`: Specifies the output directory.
221+
222+
3. Once the MapReduce job completes, use the following command to view the results:
223+
224+
```bash
225+
hdfs dfs -text /example/wordcountout/part-00000
226+
```
227+
228+
The following text is an example of the data returned by this command:
229+
230+
```output
231+
you 1128
232+
young 38
233+
younger 1
234+
youngest 1
235+
your 338
236+
yours 4
237+
yourself 34
238+
yourselves 3
239+
youth 17
240+
```
216241

217242
## Run a job: Using PowerShell
218243

219244
Use the following PowerShell script to run a MapReduce job and download the results.
220245

221246
[!code-powershell[main](../../../powershell_scripts/hdinsight/use-csharp-mapreduce/use-csharp-mapreduce.ps1?range=5-87)]
222247

223-
This script prompts you for the cluster login account name and password, along with the HDInsight cluster name. Once the job completes, the output is downloaded to a file named `output.txt`. The following text is an example of the data in the `output.txt` file:
224-
225-
you 1128
226-
young 38
227-
younger 1
228-
youngest 1
229-
your 338
230-
yours 4
231-
yourself 34
232-
yourselves 3
233-
youth 17
248+
This script prompts you for the cluster login account name and password, along with the HDInsight cluster name. Once the job completes, the output is downloaded to a file named *output.txt*. The following text is an example of the data in the `output.txt` file:
249+
250+
```output
251+
you 1128
252+
young 38
253+
younger 1
254+
youngest 1
255+
your 338
256+
yours 4
257+
yourself 34
258+
yourselves 3
259+
youth 17
260+
```
234261

235262
## Next steps
236263

237-
For more information on using MapReduce with HDInsight, see [Use MapReduce with HDInsight](hdinsight-use-mapreduce.md).
264+
For more information on using MapReduce with HDInsight, see [Use MapReduce in Apache Hadoop on HDInsight](hdinsight-use-mapreduce.md).
238265

239266
For information on using C# with Hive and Pig, see [Use a C# user-defined function with Apache Hive and Apache Pig](apache-hadoop-hive-pig-udf-dotnet-csharp.md).
240267

0 commit comments

Comments
 (0)