Skip to content

Commit f66cae4

Browse files
authored
Merge pull request #100107 from dagiro/freshness163
freshness163
2 parents deaef49 + ebed377 commit f66cae4

File tree

1 file changed

+36
-30
lines changed

1 file changed

+36
-30
lines changed

articles/hdinsight/hadoop/apache-hadoop-mahout-linux-mac.md

Lines changed: 36 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ author: hrasheed-msft
55
ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdinsightactive
98
ms.topic: conceptual
10-
ms.date: 04/24/2019
9+
ms.custom: hdinsightactive
10+
ms.date: 01/03/2020
1111
---
1212

1313
# Generate movie recommendations using Apache Mahout with Apache Hadoop in HDInsight (SSH)
@@ -20,15 +20,13 @@ Mahout is a [machine learning](https://en.wikipedia.org/wiki/Machine_learning) l
2020

2121
## Prerequisites
2222

23-
* An Apache Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](./apache-hadoop-linux-tutorial-get-started.md).
24-
25-
* An SSH client. For more information, see [Connect to HDInsight (Apache Hadoop) using SSH](../hdinsight-hadoop-linux-use-ssh-unix.md).
23+
An Apache Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](./apache-hadoop-linux-tutorial-get-started.md).
2624

2725
## Apache Mahout versioning
2826

2927
For more information about the version of Mahout in HDInsight, see [HDInsight versions and Apache Hadoop components](../hdinsight-component-versioning.md).
3028

31-
## <a name="recommendations"></a>Understanding recommendations
29+
## Understanding recommendations
3230

3331
One of the functions that is provided by Mahout is a recommendation engine. This engine accepts data in the format of `userID`, `itemId`, and `prefValue` (the preference for the item). Mahout can then perform co-occurrence analysis to determine: *users who have a preference for an item also have a preference for these other items*. Mahout then determines users with like-item preferences, which can be used to make recommendations.
3432

@@ -38,15 +36,15 @@ The following workflow is a simplified example that uses movie data:
3836

3937
* **Co-occurrence**: Bob and Alice also liked *The Phantom Menace*, *Attack of the Clones*, and *Revenge of the Sith*. Mahout determines that users who liked the previous three movies also like these three movies.
4038

41-
* **Similarity recommendation**: Because Joe liked the first three movies, Mahout looks at movies that others with similar preferences liked, but Joe has not watched (liked/rated). In this case, Mahout recommends *The Phantom Menace*, *Attack of the Clones*, and *Revenge of the Sith*.
39+
* **Similarity recommendation**: Because Joe liked the first three movies, Mahout looks at movies that others with similar preferences liked, but Joe hasn't watched (liked/rated). In this case, Mahout recommends *The Phantom Menace*, *Attack of the Clones*, and *Revenge of the Sith*.
4240

4341
### Understanding the data
4442

4543
Conveniently, [GroupLens Research](https://grouplens.org/datasets/movielens/) provides rating data for movies in a format that is compatible with Mahout. This data is available on your cluster's default storage at `/HdiSamples/HdiSamples/MahoutMovieData`.
4644

4745
There are two files, `moviedb.txt` and `user-ratings.txt`. The `user-ratings.txt` file is used during analysis. The `moviedb.txt` is used to provide user-friendly text information when viewing the results.
4846

49-
The data contained in user-ratings.txt has a structure of `userID`, `movieID`, `userRating`, and `timestamp`, which indicates how highly each user rated a movie. Here is an example of the data:
47+
The data contained in `user-ratings.txt` has a structure of `userID`, `movieID`, `userRating`, and `timestamp`, which indicates how highly each user rated a movie. Here is an example of the data:
5048

5149
196 242 3 881250949
5250
186 302 3 891717742
@@ -56,11 +54,17 @@ The data contained in user-ratings.txt has a structure of `userID`, `movieID`, `
5654

5755
## Run the analysis
5856

59-
From an SSH connection to the cluster, use the following command to run the recommendation job:
57+
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
6058

61-
```bash
62-
mahout recommenditembased -s SIMILARITY_COOCCURRENCE -i /HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt -o /example/data/mahoutout --tempDir /temp/mahouttemp
63-
```
59+
```cmd
60+
61+
```
62+
63+
1. Use the following command to run the recommendation job:
64+
65+
```bash
66+
mahout recommenditembased -s SIMILARITY_COOCCURRENCE -i /HdiSamples/HdiSamples/MahoutMovieData/user-ratings.txt -o /example/data/mahoutout --tempDir /temp/mahouttemp
67+
```
6468
6569
> [!NOTE]
6670
> The job may take several minutes to complete, and may run multiple MapReduce jobs.
@@ -75,10 +79,12 @@ mahout recommenditembased -s SIMILARITY_COOCCURRENCE -i /HdiSamples/HdiSamples/M
7579
7680
The output appears as follows:
7781
78-
1 [234:5.0,347:5.0,237:5.0,47:5.0,282:5.0,275:5.0,88:5.0,515:5.0,514:5.0,121:5.0]
79-
2 [282:5.0,210:5.0,237:5.0,234:5.0,347:5.0,121:5.0,258:5.0,515:5.0,462:5.0,79:5.0]
80-
3 [284:5.0,285:4.828125,508:4.7543354,845:4.75,319:4.705128,124:4.7045455,150:4.6938777,311:4.6769233,248:4.65625,272:4.649266]
81-
4 [690:5.0,12:5.0,234:5.0,275:5.0,121:5.0,255:5.0,237:5.0,895:5.0,282:5.0,117:5.0]
82+
```output
83+
1 [234:5.0,347:5.0,237:5.0,47:5.0,282:5.0,275:5.0,88:5.0,515:5.0,514:5.0,121:5.0]
84+
2 [282:5.0,210:5.0,237:5.0,234:5.0,347:5.0,121:5.0,258:5.0,515:5.0,462:5.0,79:5.0]
85+
3 [284:5.0,285:4.828125,508:4.7543354,845:4.75,319:4.705128,124:4.7045455,150:4.6938777,311:4.6769233,248:4.65625,272:4.649266]
86+
4 [690:5.0,12:5.0,234:5.0,275:5.0,121:5.0,255:5.0,237:5.0,895:5.0,282:5.0,117:5.0]
87+
```
8288
8389
The first column is the `userID`. The values contained in '[' and ']' are `movieId`:`recommendationScore`.
8490
@@ -169,19 +175,21 @@ mahout recommenditembased -s SIMILARITY_COOCCURRENCE -i /HdiSamples/HdiSamples/M
169175

170176
The output from this command is similar to the following text:
171177

172-
Seven Years in Tibet (1997), score=5.0
173-
Indiana Jones and the Last Crusade (1989), score=5.0
174-
Jaws (1975), score=5.0
175-
Sense and Sensibility (1995), score=5.0
176-
Independence Day (ID4) (1996), score=5.0
177-
My Best Friend's Wedding (1997), score=5.0
178-
Jerry Maguire (1996), score=5.0
179-
Scream 2 (1997), score=5.0
180-
Time to Kill, A (1996), score=5.0
178+
```output
179+
Seven Years in Tibet (1997), score=5.0
180+
Indiana Jones and the Last Crusade (1989), score=5.0
181+
Jaws (1975), score=5.0
182+
Sense and Sensibility (1995), score=5.0
183+
Independence Day (ID4) (1996), score=5.0
184+
My Best Friend's Wedding (1997), score=5.0
185+
Jerry Maguire (1996), score=5.0
186+
Scream 2 (1997), score=5.0
187+
Time to Kill, A (1996), score=5.0
188+
```
181189
182190
## Delete temporary data
183191
184-
Mahout jobs do not remove temporary data that is created while processing the job. The `--tempDir` parameter is specified in the example job to isolate the temporary files into a specific path for easy deletion. To remove the temp files, use the following command:
192+
Mahout jobs don't remove temporary data that is created while processing the job. The `--tempDir` parameter is specified in the example job to isolate the temporary files into a specific path for easy deletion. To remove the temp files, use the following command:
185193
186194
```bash
187195
hdfs dfs -rm -f -r /temp/mahouttemp
@@ -192,11 +200,9 @@ hdfs dfs -rm -f -r /temp/mahouttemp
192200
>
193201
> `hdfs dfs -rm -f -r /example/data/mahoutout`
194202
195-
196203
## Next steps
197204
198-
Now that you have learned how to use Mahout, discover other ways of working with data on HDInsight:
205+
Now that you've learned how to use Mahout, discover other ways of working with data on HDInsight:
199206
200207
* [Apache Hive with HDInsight](hdinsight-use-hive.md)
201-
* [Apache Pig with HDInsight](hdinsight-use-pig.md)
202-
* [MapReduce with HDInsight](hdinsight-use-mapreduce.md)
208+
* [MapReduce with HDInsight](hdinsight-use-mapreduce.md)

0 commit comments

Comments
 (0)