You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-mahout-linux-mac.md
+36-30Lines changed: 36 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,9 +5,9 @@ author: hrasheed-msft
5
5
ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
-
ms.custom: hdinsightactive
9
8
ms.topic: conceptual
10
-
ms.date: 04/24/2019
9
+
ms.custom: hdinsightactive
10
+
ms.date: 01/03/2020
11
11
---
12
12
13
13
# Generate movie recommendations using Apache Mahout with Apache Hadoop in HDInsight (SSH)
@@ -20,15 +20,13 @@ Mahout is a [machine learning](https://en.wikipedia.org/wiki/Machine_learning) l
20
20
21
21
## Prerequisites
22
22
23
-
* An Apache Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](./apache-hadoop-linux-tutorial-get-started.md).
24
-
25
-
* An SSH client. For more information, see [Connect to HDInsight (Apache Hadoop) using SSH](../hdinsight-hadoop-linux-use-ssh-unix.md).
23
+
An Apache Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](./apache-hadoop-linux-tutorial-get-started.md).
26
24
27
25
## Apache Mahout versioning
28
26
29
27
For more information about the version of Mahout in HDInsight, see [HDInsight versions and Apache Hadoop components](../hdinsight-component-versioning.md).
One of the functions that is provided by Mahout is a recommendation engine. This engine accepts data in the format of `userID`, `itemId`, and `prefValue` (the preference for the item). Mahout can then perform co-occurrence analysis to determine: *users who have a preference for an item also have a preference for these other items*. Mahout then determines users with like-item preferences, which can be used to make recommendations.
34
32
@@ -38,15 +36,15 @@ The following workflow is a simplified example that uses movie data:
38
36
39
37
***Co-occurrence**: Bob and Alice also liked *The Phantom Menace*, *Attack of the Clones*, and *Revenge of the Sith*. Mahout determines that users who liked the previous three movies also like these three movies.
40
38
41
-
***Similarity recommendation**: Because Joe liked the first three movies, Mahout looks at movies that others with similar preferences liked, but Joe has not watched (liked/rated). In this case, Mahout recommends *The Phantom Menace*, *Attack of the Clones*, and *Revenge of the Sith*.
39
+
***Similarity recommendation**: Because Joe liked the first three movies, Mahout looks at movies that others with similar preferences liked, but Joe hasn't watched (liked/rated). In this case, Mahout recommends *The Phantom Menace*, *Attack of the Clones*, and *Revenge of the Sith*.
42
40
43
41
### Understanding the data
44
42
45
43
Conveniently, [GroupLens Research](https://grouplens.org/datasets/movielens/) provides rating data for movies in a format that is compatible with Mahout. This data is available on your cluster's default storage at `/HdiSamples/HdiSamples/MahoutMovieData`.
46
44
47
45
There are two files, `moviedb.txt` and `user-ratings.txt`. The `user-ratings.txt` file is used during analysis. The `moviedb.txt` is used to provide user-friendly text information when viewing the results.
48
46
49
-
The data contained in user-ratings.txt has a structure of `userID`, `movieID`, `userRating`, and `timestamp`, which indicates how highly each user rated a movie. Here is an example of the data:
47
+
The data contained in `user-ratings.txt` has a structure of `userID`, `movieID`, `userRating`, and `timestamp`, which indicates how highly each user rated a movie. Here is an example of the data:
50
48
51
49
196 242 3 881250949
52
50
186 302 3 891717742
@@ -56,11 +54,17 @@ The data contained in user-ratings.txt has a structure of `userID`, `movieID`, `
56
54
57
55
## Run the analysis
58
56
59
-
From an SSH connection to the cluster, use the following command to run the recommendation job:
57
+
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
The output from this command is similar to the following text:
171
177
172
-
Seven Years in Tibet (1997), score=5.0
173
-
Indiana Jones and the Last Crusade (1989), score=5.0
174
-
Jaws (1975), score=5.0
175
-
Sense and Sensibility (1995), score=5.0
176
-
Independence Day (ID4) (1996), score=5.0
177
-
My Best Friend's Wedding (1997), score=5.0
178
-
Jerry Maguire (1996), score=5.0
179
-
Scream 2 (1997), score=5.0
180
-
Time to Kill, A (1996), score=5.0
178
+
```output
179
+
Seven Years in Tibet (1997), score=5.0
180
+
Indiana Jones and the Last Crusade (1989), score=5.0
181
+
Jaws (1975), score=5.0
182
+
Sense and Sensibility (1995), score=5.0
183
+
Independence Day (ID4) (1996), score=5.0
184
+
My Best Friend's Wedding (1997), score=5.0
185
+
Jerry Maguire (1996), score=5.0
186
+
Scream 2 (1997), score=5.0
187
+
Time to Kill, A (1996), score=5.0
188
+
```
181
189
182
190
## Delete temporary data
183
191
184
-
Mahout jobs do not remove temporary data that is created while processing the job. The `--tempDir` parameter is specified in the example job to isolate the temporary files into a specific path for easy deletion. To remove the temp files, use the following command:
192
+
Mahout jobs don't remove temporary data that is created while processing the job. The `--tempDir` parameter is specified in the example job to isolate the temporary files into a specific path for easy deletion. To remove the temp files, use the following command:
0 commit comments