You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learn how to use C# to create a MapReduce solution on HDInsight.
16
16
17
17
> [!IMPORTANT]
18
-
> Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see [HDInsight component versioning](../hdinsight-component-versioning.md).
18
+
> Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see [Apache Hadoop components on HDInsight](../hdinsight-component-versioning.md).
19
19
20
20
Apache Hadoop streaming is a utility that allows you to run MapReduce jobs using a script or executable. In this example, .NET is used to implement the mapper and reducer for a word count solution.
21
21
22
22
## .NET on HDInsight
23
23
24
-
__Linux-based HDInsight__ clusters use [Mono (https://mono-project.com)](https://mono-project.com) to run .NET applications. Mono version 4.2.1 is included with HDInsight version 3.6. For more information on the version of Mono included with HDInsight, see [HDInsight component versions](../hdinsight-component-versioning.md).
24
+
*Linux-based HDInsight* clusters use [Mono (https://mono-project.com)](https://mono-project.com) to run .NET applications. Mono version 4.2.1 is included with HDInsight version 3.6. For more information on the version of Mono included with HDInsight, see [Apache Hadoop components available with different HDInsight versions](../hdinsight-component-versioning.md#apache-hadoop-components-available-with-different-hdinsight-versions).
25
25
26
26
For more information on Mono compatibility with .NET Framework versions, see [Mono compatibility](https://www.mono-project.com/docs/about-mono/compatibility/).
27
27
28
28
## How Hadoop streaming works
29
29
30
30
The basic process used for streaming in this document is as follows:
31
31
32
-
1. Hadoop passes data to the mapper (mapper.exe in this example) on STDIN.
32
+
1. Hadoop passes data to the mapper (*mapper.exe* in this example) on STDIN.
33
33
2. The mapper processes the data, and emits tab-delimited key/value pairs to STDOUT.
34
-
3. The output is read by Hadoop, and then passed to the reducer (reducer.exe in this example) on STDIN.
34
+
3. The output is read by Hadoop, and then passed to the reducer (*reducer.exe* in this example) on STDIN.
35
35
4. The reducer reads the tab-delimited key/value pairs, processes the data, and then emits the result as tab-delimited key/value pairs on STDOUT.
36
36
5. The output is read by Hadoop and written to the output directory.
37
37
38
38
For more information on streaming, see [Hadoop Streaming](https://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/HadoopStreaming.html).
39
39
40
40
## Prerequisites
41
41
42
-
* A familiarity with writing and building C# code that targets .NET Framework 4.5. The steps in this document use Visual Studio 2017.
42
+
* Visual Studio.
43
+
44
+
* A familiarity with writing and building C# code that targets .NET Framework 4.5.
43
45
44
46
* A way to upload .exe files to the cluster. The steps in this document use the Data Lake Tools for Visual Studio to upload the files to primary storage for the cluster.
45
47
46
-
* Azure PowerShell or an SSH client.
48
+
* Azure PowerShell or a Secure Shell (SSH) client.
47
49
48
50
* A Hadoop on HDInsight cluster. For more information on creating a cluster, see [Create an HDInsight cluster](../hdinsight-hadoop-provision-linux-clusters.md).
49
51
50
52
## Create the mapper
51
53
52
-
In Visual Studio, create a new __Console application__named __mapper__. Use the following code for the application:
54
+
In Visual Studio, create a new .NET Framework console application named *mapper*. Use the following code for the application:
53
55
54
56
```csharp
55
57
usingSystem;
@@ -82,11 +84,11 @@ namespace mapper
82
84
}
83
85
```
84
86
85
-
After creating the application, build it to produce the `/bin/Debug/mapper.exe` file in the project directory.
87
+
After you create the application, build it to produce the */bin/Debug/mapper.exe* file in the project directory.
86
88
87
89
## Create the reducer
88
90
89
-
In Visual Studio, create a new __Console application__named __reducer__. Use the following code for the application:
91
+
In Visual Studio, create a new .NET Framework console application named *reducer*. Use the following code for the application:
90
92
91
93
```csharp
92
94
usingSystem;
@@ -135,106 +137,131 @@ namespace reducer
135
137
}
136
138
```
137
139
138
-
After creating the application, build it to produce the `/bin/Debug/reducer.exe` file in the project directory.
140
+
After you create the application, build it to produce the */bin/Debug/reducer.exe* file in the project directory.
139
141
140
142
## Upload to storage
141
143
142
-
1. In Visual Studio, open **Server Explorer**.
144
+
Next, you need to upload the *mapper* and *reducer* applications to HDInsight storage.
145
+
146
+
1. In Visual Studio, choose **View** > **Server Explorer**.
143
147
144
148
2. Expand **Azure**, and then expand **HDInsight**.
145
149
146
-
3. If prompted, enter your Azure subscription credentials, and then click**Sign In**.
150
+
3. If prompted, enter your Azure subscription credentials, and then select**Sign In**.
147
151
148
-
4. Expand the HDInsight cluster that you wish to deploy this application to. An entry with the text __(Default Storage Account)__ is listed.
152
+
4. Expand the HDInsight cluster that you wish to deploy this application to. An entry with the text **(Default Storage Account)** is listed.
149
153
150
-

154
+

151
155
152
-
* If this entry can be expanded, you are using an __Azure Storage Account__ as default storage for the cluster. To view the files on the default storage for the cluster, expand the entry and then double-click the __(Default Container)__.
156
+
* If the **(Default Storage Account)**entry can be expanded, you're using an **Azure Storage Account** as default storage for the cluster. To view the files on the default storage for the cluster, expand the entry and then double-click **(Default Container)**.
153
157
154
-
* If this entry cannot be expanded, you are using __Azure Data Lake Storage__ as the default storage for the cluster. To view the files on the default storage for the cluster, double-click the __(Default Storage Account)__ entry.
158
+
* If the **(Default Storage Account)**entry can't be expanded, you're using **Azure Data Lake Storage** as the default storage for the cluster. To view the files on the default storage for the cluster, double-click the **(Default Storage Account)** entry.
155
159
156
160
5. To upload the .exe files, use one of the following methods:
157
161
158
-
* If using an __Azure Storage Account__, click the upload icon, and then browse to the **bin\debug** folder for the **mapper** project. Finally, select the **mapper.exe** file and click **Ok**.
159
-
160
-

161
-
162
-
* If using __Azure Data Lake Storage__, right-click an empty area in the file listing, and then select __Upload__. Finally, select the **mapper.exe** file and click **Open**.
163
-
164
-
Once the __mapper.exe__ upload has finished, repeat the upload process for the __reducer.exe__ file.
162
+
* If you're using an **Azure Storage Account**, select the **Upload Blob** icon.
165
163
166
-
## Run a job: Using an SSH session
167
-
168
-
1. Use SSH to connect to the HDInsight cluster. For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
169
-
170
-
2. Use one of the following commands to start the MapReduce job:
164
+

171
165
172
-
* If using __Data Lake Storage Gen2__ as default storage:
166
+
In the **Upload New File** dialog box, under **File name**, select **Browse**. In the **Upload Blob** dialog box, go to the *bin\debug* folder for the *mapper* project, and then choose the *mapper.exe* file. Finally, select **Open** and then **OK** to complete the upload.
* For **Azure Data Lake Storage**, right-click an empty area in the file listing, and then select **Upload**. Finally, select the *mapper.exe* file and then select **Open**.
177
169
178
-
* If using __Data Lake Storage Gen1__ as default storage:
170
+
Once the *mapper.exe* upload has finished, repeat the upload process for the *reducer.exe* file.
The following list describes what each parameter does:
191
-
192
-
*`hadoop-streaming.jar`: The jar file that contains the streaming MapReduce functionality.
193
-
*`-files`: Adds the `mapper.exe` and `reducer.exe` files to this job. The `abfs:///`,`adl:///` or `wasb:///` before each file is the path to the root of default storage for the cluster.
194
-
*`-mapper`: Specifies which file implements the mapper.
195
-
*`-reducer`: Specifies which file implements the reducer.
196
-
*`-input`: The input data.
197
-
*`-output`: The output directory.
172
+
## Run a job: Using an SSH session
198
173
199
-
3. Once the MapReduce job completes, use the following to view the results:
174
+
The following procedure describes how to run a MapReduce job using an SSH session:
200
175
201
-
```bash
202
-
hdfs dfs -text /example/wordcountout/part-00000
203
-
```
176
+
1. Use SSH to connect to the HDInsight cluster. (For example, run the command `ssh sshuser@<clustername>-ssh.azurehdinsight.net`.) For more information, see [Use SSH with HDInsight](../hdinsight-hadoop-linux-use-ssh-unix.md).
204
177
205
-
The following text is an example of the data returned by this command:
178
+
2. Use one of the following commands to start the MapReduce job:
206
179
207
-
you 1128
208
-
young 38
209
-
younger 1
210
-
youngest 1
211
-
your 338
212
-
yours 4
213
-
yourself 34
214
-
yourselves 3
215
-
youth 17
180
+
* If the default storage is **Azure Storage**:
181
+
182
+
```bash
183
+
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
184
+
-files wasb:///mapper.exe,wasb:///reducer.exe \
185
+
-mapper mapper.exe \
186
+
-reducer reducer.exe \
187
+
-input /example/data/gutenberg/davinci.txt \
188
+
-output /example/wordcountout
189
+
```
190
+
191
+
* If the default storage is **Data Lake Storage Gen1**:
192
+
193
+
```bash
194
+
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
195
+
-files adl:///mapper.exe,adl:///reducer.exe \
196
+
-mapper mapper.exe \
197
+
-reducer reducer.exe \
198
+
-input /example/data/gutenberg/davinci.txt \
199
+
-output /example/wordcountout
200
+
```
201
+
202
+
* If the default storage is **Data Lake Storage Gen2**:
203
+
204
+
```bash
205
+
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
206
+
-files abfs:///mapper.exe,abfs:///reducer.exe \
207
+
-mapper mapper.exe \
208
+
-reducer reducer.exe \
209
+
-input /example/data/gutenberg/davinci.txt \
210
+
-output /example/wordcountout
211
+
```
212
+
213
+
The following list describes what each parameter and option represents:
214
+
215
+
**hadoop-streaming.jar*: Specifies the jar file that contains the streaming MapReduce functionality.
216
+
*`-files`: Specifies the *mapper.exe* and *reducer.exe* files for this job. The `wasb:///`, `adl:///`, or `abfs:///` protocol declaration before each file is the path to the root of default storage for the cluster.
217
+
*`-mapper`: Specifies the file that implements the mapper.
218
+
*`-reducer`: Specifies the file that implements the reducer.
219
+
*`-input`: Specifies the input data.
220
+
*`-output`: Specifies the output directory.
221
+
222
+
3. Once the MapReduce job completes, use the following command to view the results:
223
+
224
+
```bash
225
+
hdfs dfs -text /example/wordcountout/part-00000
226
+
```
227
+
228
+
The following text is an example of the data returned by this command:
229
+
230
+
```output
231
+
you 1128
232
+
young 38
233
+
younger 1
234
+
youngest 1
235
+
your 338
236
+
yours 4
237
+
yourself 34
238
+
yourselves 3
239
+
youth 17
240
+
```
216
241
217
242
## Run a job: Using PowerShell
218
243
219
244
Use the following PowerShell script to run a MapReduce job and download the results.
This script prompts you forthe cluster login account name and password, along with the HDInsight cluster name. Once the job completes, the output is downloaded to a file named `output.txt`. The following text is an example of the datain the `output.txt` file:
224
-
225
-
you 1128
226
-
young 38
227
-
younger 1
228
-
youngest 1
229
-
your 338
230
-
yours 4
231
-
yourself 34
232
-
yourselves 3
233
-
youth 17
248
+
This script prompts you forthe cluster login account name and password, along with the HDInsight cluster name. Once the job completes, the output is downloaded to a file named *output.txt*. The following text is an example of the datain the `output.txt` file:
249
+
250
+
```output
251
+
you 1128
252
+
young 38
253
+
younger 1
254
+
youngest 1
255
+
your 338
256
+
yours 4
257
+
yourself 34
258
+
yourselves 3
259
+
youth 17
260
+
```
234
261
235
262
## Next steps
236
263
237
-
For more information on using MapReduce with HDInsight, see [Use MapReduce with HDInsight](hdinsight-use-mapreduce.md).
264
+
For more information on using MapReduce with HDInsight, see [Use MapReduce in Apache Hadoop on HDInsight](hdinsight-use-mapreduce.md).
238
265
239
266
For information on using C# with Hive and Pig, see [Use a C# user-defined functionwith Apache Hive and Apache Pig](apache-hadoop-hive-pig-udf-dotnet-csharp.md).
0 commit comments