You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-machine-learning-mllib-ipython.md
+43-39Lines changed: 43 additions & 39 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,39 +1,39 @@
1
1
---
2
2
title: Machine learning example with Spark MLlib on HDInsight - Azure
3
3
description: Learn how to use Spark MLlib to create a machine learning app that analyzes a dataset using classification through logistic regression.
4
-
keywords: spark machine learning, spark machine learning example
5
4
author: hrasheed-msft
5
+
ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
-
8
7
ms.service: hdinsight
9
-
ms.custom: hdinsightactive,hdiseo17may2017
10
8
ms.topic: conceptual
11
-
ms.date: 06/17/2019
12
-
ms.author: hrasheed
13
-
9
+
ms.custom: hdinsightactive,hdiseo17may2017
10
+
ms.date: 04/16/2020
14
11
---
12
+
15
13
# Use Apache Spark MLlib to build a machine learning application and analyze a dataset
16
14
17
-
Learn how to use Apache Spark [MLlib](https://spark.apache.org/mllib/) to create a machine learning application to do simple predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
15
+
Learn how to use Apache Spark [MLlib](https://spark.apache.org/mllib/) to create a machine learning application. The application will do predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
18
16
19
-
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, including utilities that are suitable for:
17
+
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, such as:
20
18
21
19
* Classification
22
20
* Regression
23
21
* Clustering
24
-
*Topic modeling
22
+
*Modeling
25
23
* Singular value decomposition (SVD) and principal component analysis (PCA)
26
24
* Hypothesis testing and calculating sample statistics
27
25
28
26
## Understand classification and logistic regression
29
-
*Classification*, a popular machine learning task, is the process of sorting input data into categories. It is the job of a classification algorithm to figure out how to assign "labels" to input data that you provide. For example, you could think of a machine learning algorithm that accepts stock information as input and divides the stock into two categories: stocks that you should sell and stocks that you should keep.
27
+
28
+
*Classification*, a popular machine learning task, is the process of sorting input data into categories. It's the job of a classification algorithm to figure out how to assign "labels" to input data that you provide. For example, you could think of a machine learning algorithm that accepts stock information as input. Then divides the stock into two categories: stocks that you should sell and stocks that you should keep.
30
29
31
30
Logistic regression is the algorithm that you use for classification. Spark's logistic regression API is useful for *binary classification*, or classifying input data into one of two groups. For more information about logistic regressions, see [Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression).
32
31
33
-
In summary, the process of logistic regression produces a *logistic function* that can be used to predict the probability that an input vector belongs in one group or the other.
32
+
In summary, the process of logistic regression produces a *logistic function*. Use the function to predict the probability that an input vector belongs in one group or the other.
34
33
35
34
## Predictive analysis example on food inspection data
36
-
In this example, you use Spark to perform some predictive analysis on food inspection data (**Food_Inspections1.csv**) that was acquired through the [City of Chicago data portal](https://data.cityofchicago.org/). This dataset contains information about food establishment inspections that were conducted in Chicago, including information about each establishment, the violations found (if any), and the results of the inspection. The CSV data file is already available in the storage account associated with the cluster at **/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv**.
35
+
36
+
In this example, you use Spark to do some predictive analysis on food inspection data (**Food_Inspections1.csv**). Data acquired through the [City of Chicago data portal](https://data.cityofchicago.org/). This dataset contains information about food establishment inspections that were conducted in Chicago. Including information about each establishment, the violations found (if any), and the results of the inspection. The CSV data file is already available in the storage account associated with the cluster at **/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv**.
37
37
38
38
In the steps below, you develop a model to see what it takes to pass or fail a food inspection.
39
39
@@ -51,11 +51,12 @@ In the steps below, you develop a model to see what it takes to pass or fail a f
51
51
from pyspark.sql.functions import UserDefinedFunction
52
52
from pyspark.sql.types import *
53
53
```
54
-
Because of the PySpark kernel, you do not need to create any contexts explicitly. The Spark and Hive contexts are automatically created for you when you run the first code cell.
54
+
55
+
Because of the PySpark kernel, you don't need to create any contexts explicitly. The Spark and Hive contexts are automatically created when you run the first code cell.
55
56
56
57
## Construct the input dataframe
57
58
58
-
Because the raw data is in a CSV format, you can use the Spark context to pull the file into memory as unstructured text, and then use Python's CSV library to parse each line of the data.
59
+
Use the Spark context to pull the raw CSV data into memory as unstructured text. Then use Python's CSV library to parse each line of the data.
59
60
60
61
1. Run the following lines to create a Resilient Distributed Dataset (RDD) by importing and parsing the input data.
61
62
@@ -67,7 +68,7 @@ Because the raw data is in a CSV format, you can use the Spark context to pull t
@@ -100,22 +101,22 @@ Because the raw data is in a CSV format, you can use the Spark context to pull t
100
101
'(41.97583445690982, -87.7107455232781)']]
101
102
```
102
103
103
-
The output gives you an idea of the schema of the input file. It includes the name of every establishment, the type of establishment, the address, the data of the inspections, and the location, among other things.
104
+
The output gives you an idea of the schema of the input file. It includes the name of every establishment, and the type of establishment. Also, the address, the data of the inspections, and the location, among other things.
104
105
105
-
3. Run the following code to create a dataframe (*df*) and a temporary table (*CountResults*) with a few columns that are useful for the predictive analysis. `sqlContext` is used to perform transformations on structured data.
106
+
3. Run the following code to create a dataframe (*df*) and a temporary table (*CountResults*) with a few columns that are useful for the predictive analysis. `sqlContext` is used to do transformations on structured data.
3. You can also use [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib), a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **countResultsdf** dataframe, the code snippet must begin with the `%%local` magic. This ensures that the code is run locally on the Jupyter server.
178
+
3. You can also use [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib), a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **countResultsdf** dataframe, the code snippet must begin with the `%%local` magic. This action ensures that the code is run locally on the Jupyter server.
179
179
180
180
```PySpark
181
181
%%local
@@ -189,10 +189,6 @@ Let's start to get a sense of what the dataset contains.
189
189
plt.axis('equal')
190
190
```
191
191
192
-
The output is:
193
-
194
-

195
-
196
192
To predict a food inspection outcome, you need to develop a model based on the violations. Because logistic regression is a binary classification method, it makes sense to group the result data into two categories: **Fail** and **Pass**:
197
193
198
194
- Pass
@@ -204,9 +200,9 @@ Let's start to get a sense of what the dataset contains.
204
200
- Business not located
205
201
- Out of Business
206
202
207
-
Data with the other results ("Business Not Located" or "Out of Business") are not useful, and they make up a very small percentage of the results anyway.
203
+
Data with the other results ("Business Not Located" or "Out of Business") aren't useful, and they make up a small percentage of the results anyway.
208
204
209
-
4. Run the following code to convert the existing dataframe(`df`) into a new dataframe where each inspection is represented as a label-violations pair. In this case, a label of `0.0` represents a failure, a label of `1.0` represents a success, and a label of `-1.0` represents some results besides those two.
205
+
4. Run the following code to convert the existing dataframe(`df`) into a new dataframe where each inspection is represented as a label-violations pair. In this case, a label of `0.0` represents a failure, a label of `1.0` represents a success, and a label of `-1.0` represents some results besides those two results.
210
206
211
207
```PySpark
212
208
def labelForResults(s):
@@ -234,11 +230,11 @@ Let's start to get a sense of what the dataset contains.
234
230
235
231
## Create a logistic regression model from the input dataframe
236
232
237
-
The final task is to convert the labeled datainto a format that can be analyzed by logistic regression. The input to a logistic regression algorithm needs be a set of *label-feature vector pairs*, where the "feature vector" is a vector of numbers representing the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text, to an array of real numbers that a machine could easily understand.
233
+
The final task is to convert the labeled data. Convert the data into a format that can be analyzed by logistic regression. The input to a logistic regression algorithm needs a set of *label-feature vector pairs*. Where the "feature vector" is a vector of numbers that represent the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text. Convert the column to an array of real numbers that a machine could easily understand.
238
234
239
-
One standard machine learning approach for processing natural language is to assign each distinct word an "index", and then pass a vector to the machine learning algorithm such that each index's value contains the relative frequency of that word in the text string.
235
+
One standard machine learning approach for processing natural language is to assign each distinct word an "index". Then pass a vector to the machine learning algorithm. Such that each index's value contains the relative frequency of that word in the text string.
240
236
241
-
MLlib provides an easy way to perform this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a "pipeline".
237
+
MLlib provides an easy way to do this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a "pipeline".
@@ -251,7 +247,7 @@ model = pipeline.fit(labeledData)
251
247
252
248
## Evaluate the model using another dataset
253
249
254
-
You can use the model you created earlier to *predict* what the results of new inspections will be, based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
250
+
You can use the model you created earlier to *predict* what the results of new inspections will be. The predictions are based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
255
251
256
252
1. Run the following code to create a new dataframe, **predictionsDf** that contains the prediction generated by the model. The snippet also creates a temporary table called **Predictions** based on the dataframe.
257
253
@@ -265,7 +261,7 @@ You can use the model you created earlier to *predict* what the results of new i
265
261
predictionsDf.columns
266
262
```
267
263
268
-
You should see an output like the following:
264
+
You should see an output like the following text:
269
265
270
266
```
271
267
['id',
@@ -285,8 +281,9 @@ You can use the model you created earlier to *predict* what the results of new i
285
281
predictionsDf.take(1)
286
282
```
287
283
288
-
There is a prediction for the first entry in the test data set.
289
-
1. The `model.transform()` method applies the same transformation to any new data with the same schema, and arrive at a prediction of how to classify the data. You can do some simple statistics to get a sense of how accurate the predictions were:
284
+
There's a prediction for the first entry in the test data set.
285
+
286
+
1. The `model.transform()` method applies the same transformation to any new data with the same schema, and arrive at a prediction of how to classify the data. You can do some statistics to get a sense of how the predictions were:
290
287
291
288
```PySpark
292
289
numSuccesses = predictionsDf.where("""(prediction = 0 AND results = 'Fail') OR
@@ -298,16 +295,17 @@ You can use the model you created earlier to *predict* what the results of new i
There were 9315 inspections and there were 8087 successful predictions
305
302
This is a 86.8169618894% success rate
306
303
```
307
304
308
-
Using logistic regression with Spark gives you an accurate model of the relationship between violations descriptions in English and whether a given business would pass or fail a food inspection.
305
+
Using logistic regression with Spark gives you a model of the relationship between violations descriptions in English. And whether a given business would pass or fail a food inspection.
309
306
310
307
## Create a visual representation of the prediction
308
+
311
309
You can now construct a final visualization to help you reason about the results of this test.
312
310
313
311
1. You start by extracting the different predictions and results from the **Predictions** temporary table created earlier. The following queries separate the output as *true_positive*, *false_positive*, *true_negative*, and *false_negative*. In the queries below, you turn off visualization by using `-q` and also save the output (by using `-o`) as dataframes that can be then used with the `%%local` magic.
@@ -353,21 +351,26 @@ You can now construct a final visualization to help you reason about the results
353
351
In this chart, a "positive" result refers to the failed food inspection, while a negative result refers to a passed inspection.
354
352
355
353
## Shut down the notebook
356
-
After you have finished running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This shuts down and closes the notebook.
357
354
358
-
## <a name="seealso"></a>See also
355
+
After you have finished running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
356
+
357
+
## Next steps
358
+
359
359
* [Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
360
360
361
361
### Scenarios
362
-
* [Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools](apache-spark-use-bi-tools.md)
362
+
363
+
* [Apache Spark with BI: Interactive data analysis using Spark in HDInsight with BI tools](apache-spark-use-bi-tools.md)
363
364
* [Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data](apache-spark-ipython-notebook-machine-learning.md)
364
365
* [Website log analysis using Apache Spark in HDInsight](apache-spark-custom-library-website-log-analysis.md)
365
366
366
367
### Create and run applications
368
+
367
369
* [Create a standalone application using Scala](apache-spark-create-standalone-application.md)
368
370
* [Run jobs remotely on an Apache Spark cluster using Apache Livy](apache-spark-livy-rest-interface.md)
369
371
370
372
### Tools and extensions
373
+
371
374
* [Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications](apache-spark-intellij-tool-plugin.md)
372
375
* [Use HDInsight Tools Plugin for IntelliJ IDEA to debug Apache Spark applications remotely](apache-spark-intellij-tool-plugin-debug-jobs-remotely.md)
373
376
* [Use Apache Zeppelin notebooks with an Apache Spark cluster on HDInsight](apache-spark-zeppelin-notebook.md)
@@ -376,5 +379,6 @@ After you have finished running the application, you should shut down the notebo
376
379
* [Install Jupyter on your computer and connect to an HDInsight Spark cluster](apache-spark-jupyter-notebook-install-locally.md)
377
380
378
381
### Manage resources
382
+
379
383
* [Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
380
384
* [Track and debug jobs running on an Apache Spark cluster in HDInsight](apache-spark-job-debugging.md)
0 commit comments