Skip to content

Commit b28ac87

Browse files
authored
Merge pull request #111646 from dagiro/freshness62
freshness62
2 parents d24ae7f + 49cba6c commit b28ac87

File tree

1 file changed

+43
-39
lines changed

1 file changed

+43
-39
lines changed

articles/hdinsight/spark/apache-spark-machine-learning-mllib-ipython.md

Lines changed: 43 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,39 @@
11
---
22
title: Machine learning example with Spark MLlib on HDInsight - Azure
33
description: Learn how to use Spark MLlib to create a machine learning app that analyzes a dataset using classification through logistic regression.
4-
keywords: spark machine learning, spark machine learning example
54
author: hrasheed-msft
5+
ms.author: hrasheed
66
ms.reviewer: jasonh
7-
87
ms.service: hdinsight
9-
ms.custom: hdinsightactive,hdiseo17may2017
108
ms.topic: conceptual
11-
ms.date: 06/17/2019
12-
ms.author: hrasheed
13-
9+
ms.custom: hdinsightactive,hdiseo17may2017
10+
ms.date: 04/16/2020
1411
---
12+
1513
# Use Apache Spark MLlib to build a machine learning application and analyze a dataset
1614

17-
Learn how to use Apache Spark [MLlib](https://spark.apache.org/mllib/) to create a machine learning application to do simple predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
15+
Learn how to use Apache Spark [MLlib](https://spark.apache.org/mllib/) to create a machine learning application. The application will do predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
1816

19-
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, including utilities that are suitable for:
17+
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, such as:
2018

2119
* Classification
2220
* Regression
2321
* Clustering
24-
* Topic modeling
22+
* Modeling
2523
* Singular value decomposition (SVD) and principal component analysis (PCA)
2624
* Hypothesis testing and calculating sample statistics
2725

2826
## Understand classification and logistic regression
29-
*Classification*, a popular machine learning task, is the process of sorting input data into categories. It is the job of a classification algorithm to figure out how to assign "labels" to input data that you provide. For example, you could think of a machine learning algorithm that accepts stock information as input and divides the stock into two categories: stocks that you should sell and stocks that you should keep.
27+
28+
*Classification*, a popular machine learning task, is the process of sorting input data into categories. It's the job of a classification algorithm to figure out how to assign "labels" to input data that you provide. For example, you could think of a machine learning algorithm that accepts stock information as input. Then divides the stock into two categories: stocks that you should sell and stocks that you should keep.
3029

3130
Logistic regression is the algorithm that you use for classification. Spark's logistic regression API is useful for *binary classification*, or classifying input data into one of two groups. For more information about logistic regressions, see [Wikipedia](https://en.wikipedia.org/wiki/Logistic_regression).
3231

33-
In summary, the process of logistic regression produces a *logistic function* that can be used to predict the probability that an input vector belongs in one group or the other.
32+
In summary, the process of logistic regression produces a *logistic function*. Use the function to predict the probability that an input vector belongs in one group or the other.
3433

3534
## Predictive analysis example on food inspection data
36-
In this example, you use Spark to perform some predictive analysis on food inspection data (**Food_Inspections1.csv**) that was acquired through the [City of Chicago data portal](https://data.cityofchicago.org/). This dataset contains information about food establishment inspections that were conducted in Chicago, including information about each establishment, the violations found (if any), and the results of the inspection. The CSV data file is already available in the storage account associated with the cluster at **/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv**.
35+
36+
In this example, you use Spark to do some predictive analysis on food inspection data (**Food_Inspections1.csv**). Data acquired through the [City of Chicago data portal](https://data.cityofchicago.org/). This dataset contains information about food establishment inspections that were conducted in Chicago. Including information about each establishment, the violations found (if any), and the results of the inspection. The CSV data file is already available in the storage account associated with the cluster at **/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv**.
3737

3838
In the steps below, you develop a model to see what it takes to pass or fail a food inspection.
3939

@@ -51,11 +51,12 @@ In the steps below, you develop a model to see what it takes to pass or fail a f
5151
from pyspark.sql.functions import UserDefinedFunction
5252
from pyspark.sql.types import *
5353
```
54-
Because of the PySpark kernel, you do not need to create any contexts explicitly. The Spark and Hive contexts are automatically created for you when you run the first code cell.
54+
55+
Because of the PySpark kernel, you don't need to create any contexts explicitly. The Spark and Hive contexts are automatically created when you run the first code cell.
5556
5657
## Construct the input dataframe
5758
58-
Because the raw data is in a CSV format, you can use the Spark context to pull the file into memory as unstructured text, and then use Python's CSV library to parse each line of the data.
59+
Use the Spark context to pull the raw CSV data into memory as unstructured text. Then use Python's CSV library to parse each line of the data.
5960
6061
1. Run the following lines to create a Resilient Distributed Dataset (RDD) by importing and parsing the input data.
6162
@@ -67,7 +68,7 @@ Because the raw data is in a CSV format, you can use the Spark context to pull t
6768
value = csv.reader(sio).next()
6869
sio.close()
6970
return value
70-
71+
7172
inspections = sc.textFile('/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv')\
7273
.map(csvParse)
7374
```
@@ -100,22 +101,22 @@ Because the raw data is in a CSV format, you can use the Spark context to pull t
100101
'(41.97583445690982, -87.7107455232781)']]
101102
```
102103
103-
The output gives you an idea of the schema of the input file. It includes the name of every establishment, the type of establishment, the address, the data of the inspections, and the location, among other things.
104+
The output gives you an idea of the schema of the input file. It includes the name of every establishment, and the type of establishment. Also, the address, the data of the inspections, and the location, among other things.
104105
105-
3. Run the following code to create a dataframe (*df*) and a temporary table (*CountResults*) with a few columns that are useful for the predictive analysis. `sqlContext` is used to perform transformations on structured data.
106+
3. Run the following code to create a dataframe (*df*) and a temporary table (*CountResults*) with a few columns that are useful for the predictive analysis. `sqlContext` is used to do transformations on structured data.
106107
107108
```PySpark
108109
schema = StructType([
109110
StructField("id", IntegerType(), False),
110111
StructField("name", StringType(), False),
111112
StructField("results", StringType(), False),
112113
StructField("violations", StringType(), True)])
113-
114+
114115
df = spark.createDataFrame(inspections.map(lambda l: (int(l[0]), l[1], l[12], l[13])) , schema)
115116
df.registerTempTable('CountResults')
116117
```
117118
118-
The four columns of interest in the dataframe are **id**, **name**, **results**, and **violations**.
119+
The four columns of interest in the dataframe are **ID**, **name**, **results**, and **violations**.
119120
120121
4. Run the following code to get a small sample of the data:
121122
@@ -174,8 +175,7 @@ Let's start to get a sense of what the dataset contains.
174175
175176
![SQL query output](./media/apache-spark-machine-learning-mllib-ipython/spark-machine-learning-query-output.png "SQL query output")
176177
177-
178-
3. You can also use [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib), a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **countResultsdf** dataframe, the code snippet must begin with the `%%local` magic. This ensures that the code is run locally on the Jupyter server.
178+
3. You can also use [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib), a library used to construct visualization of data, to create a plot. Because the plot must be created from the locally persisted **countResultsdf** dataframe, the code snippet must begin with the `%%local` magic. This action ensures that the code is run locally on the Jupyter server.
179179
180180
```PySpark
181181
%%local
@@ -189,10 +189,6 @@ Let's start to get a sense of what the dataset contains.
189189
plt.axis('equal')
190190
```
191191
192-
The output is:
193-
194-
![Spark machine learning application output - pie chart with five distinct inspection results](./media/apache-spark-machine-learning-mllib-ipython/spark-machine-learning-result-output-1.png "Spark machine learning result output")
195-
196192
To predict a food inspection outcome, you need to develop a model based on the violations. Because logistic regression is a binary classification method, it makes sense to group the result data into two categories: **Fail** and **Pass**:
197193
198194
- Pass
@@ -204,9 +200,9 @@ Let's start to get a sense of what the dataset contains.
204200
- Business not located
205201
- Out of Business
206202
207-
Data with the other results ("Business Not Located" or "Out of Business") are not useful, and they make up a very small percentage of the results anyway.
203+
Data with the other results ("Business Not Located" or "Out of Business") aren't useful, and they make up a small percentage of the results anyway.
208204
209-
4. Run the following code to convert the existing dataframe(`df`) into a new dataframe where each inspection is represented as a label-violations pair. In this case, a label of `0.0` represents a failure, a label of `1.0` represents a success, and a label of `-1.0` represents some results besides those two.
205+
4. Run the following code to convert the existing dataframe(`df`) into a new dataframe where each inspection is represented as a label-violations pair. In this case, a label of `0.0` represents a failure, a label of `1.0` represents a success, and a label of `-1.0` represents some results besides those two results.
210206
211207
```PySpark
212208
def labelForResults(s):
@@ -234,11 +230,11 @@ Let's start to get a sense of what the dataset contains.
234230
235231
## Create a logistic regression model from the input dataframe
236232
237-
The final task is to convert the labeled data into a format that can be analyzed by logistic regression. The input to a logistic regression algorithm needs be a set of *label-feature vector pairs*, where the "feature vector" is a vector of numbers representing the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text, to an array of real numbers that a machine could easily understand.
233+
The final task is to convert the labeled data. Convert the data into a format that can be analyzed by logistic regression. The input to a logistic regression algorithm needs a set of *label-feature vector pairs*. Where the "feature vector" is a vector of numbers that represent the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text. Convert the column to an array of real numbers that a machine could easily understand.
238234
239-
One standard machine learning approach for processing natural language is to assign each distinct word an "index", and then pass a vector to the machine learning algorithm such that each index's value contains the relative frequency of that word in the text string.
235+
One standard machine learning approach for processing natural language is to assign each distinct word an "index". Then pass a vector to the machine learning algorithm. Such that each index's value contains the relative frequency of that word in the text string.
240236
241-
MLlib provides an easy way to perform this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a "pipeline".
237+
MLlib provides an easy way to do this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a "pipeline".
242238
243239
```PySpark
244240
tokenizer = Tokenizer(inputCol="violations", outputCol="words")
@@ -251,7 +247,7 @@ model = pipeline.fit(labeledData)
251247

252248
## Evaluate the model using another dataset
253249

254-
You can use the model you created earlier to *predict* what the results of new inspections will be, based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
250+
You can use the model you created earlier to *predict* what the results of new inspections will be. The predictions are based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
255251

256252
1. Run the following code to create a new dataframe, **predictionsDf** that contains the prediction generated by the model. The snippet also creates a temporary table called **Predictions** based on the dataframe.
257253

@@ -265,7 +261,7 @@ You can use the model you created earlier to *predict* what the results of new i
265261
predictionsDf.columns
266262
```
267263
268-
You should see an output like the following:
264+
You should see an output like the following text:
269265
270266
```
271267
['id',
@@ -285,8 +281,9 @@ You can use the model you created earlier to *predict* what the results of new i
285281
predictionsDf.take(1)
286282
```
287283
288-
There is a prediction for the first entry in the test data set.
289-
1. The `model.transform()` method applies the same transformation to any new data with the same schema, and arrive at a prediction of how to classify the data. You can do some simple statistics to get a sense of how accurate the predictions were:
284+
There's a prediction for the first entry in the test data set.
285+
286+
1. The `model.transform()` method applies the same transformation to any new data with the same schema, and arrive at a prediction of how to classify the data. You can do some statistics to get a sense of how the predictions were:
290287
291288
```PySpark
292289
numSuccesses = predictionsDf.where("""(prediction = 0 AND results = 'Fail') OR
@@ -298,16 +295,17 @@ You can use the model you created earlier to *predict* what the results of new i
298295
print "This is a", str((float(numSuccesses) / float(numInspections)) * 100) + "%", "success rate"
299296
```
300297
301-
The output looks like the following:
298+
The output looks like the following text:
302299
303300
```
304301
There were 9315 inspections and there were 8087 successful predictions
305302
This is a 86.8169618894% success rate
306303
```
307304
308-
Using logistic regression with Spark gives you an accurate model of the relationship between violations descriptions in English and whether a given business would pass or fail a food inspection.
305+
Using logistic regression with Spark gives you a model of the relationship between violations descriptions in English. And whether a given business would pass or fail a food inspection.
309306
310307
## Create a visual representation of the prediction
308+
311309
You can now construct a final visualization to help you reason about the results of this test.
312310
313311
1. You start by extracting the different predictions and results from the **Predictions** temporary table created earlier. The following queries separate the output as *true_positive*, *false_positive*, *true_negative*, and *false_negative*. In the queries below, you turn off visualization by using `-q` and also save the output (by using `-o`) as dataframes that can be then used with the `%%local` magic.
@@ -353,21 +351,26 @@ You can now construct a final visualization to help you reason about the results
353351
In this chart, a "positive" result refers to the failed food inspection, while a negative result refers to a passed inspection.
354352
355353
## Shut down the notebook
356-
After you have finished running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This shuts down and closes the notebook.
357354
358-
## <a name="seealso"></a>See also
355+
After you have finished running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
356+
357+
## Next steps
358+
359359
* [Overview: Apache Spark on Azure HDInsight](apache-spark-overview.md)
360360
361361
### Scenarios
362-
* [Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools](apache-spark-use-bi-tools.md)
362+
363+
* [Apache Spark with BI: Interactive data analysis using Spark in HDInsight with BI tools](apache-spark-use-bi-tools.md)
363364
* [Apache Spark with Machine Learning: Use Spark in HDInsight for analyzing building temperature using HVAC data](apache-spark-ipython-notebook-machine-learning.md)
364365
* [Website log analysis using Apache Spark in HDInsight](apache-spark-custom-library-website-log-analysis.md)
365366
366367
### Create and run applications
368+
367369
* [Create a standalone application using Scala](apache-spark-create-standalone-application.md)
368370
* [Run jobs remotely on an Apache Spark cluster using Apache Livy](apache-spark-livy-rest-interface.md)
369371
370372
### Tools and extensions
373+
371374
* [Use HDInsight Tools Plugin for IntelliJ IDEA to create and submit Spark Scala applications](apache-spark-intellij-tool-plugin.md)
372375
* [Use HDInsight Tools Plugin for IntelliJ IDEA to debug Apache Spark applications remotely](apache-spark-intellij-tool-plugin-debug-jobs-remotely.md)
373376
* [Use Apache Zeppelin notebooks with an Apache Spark cluster on HDInsight](apache-spark-zeppelin-notebook.md)
@@ -376,5 +379,6 @@ After you have finished running the application, you should shut down the notebo
376379
* [Install Jupyter on your computer and connect to an HDInsight Spark cluster](apache-spark-jupyter-notebook-install-locally.md)
377380
378381
### Manage resources
382+
379383
* [Manage resources for the Apache Spark cluster in Azure HDInsight](apache-spark-resource-manager.md)
380384
* [Track and debug jobs running on an Apache Spark cluster in HDInsight](apache-spark-job-debugging.md)

0 commit comments

Comments
 (0)