You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/apache-spark-machine-learning-mllib-ipython.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,12 +4,12 @@ description: Learn how to use Spark MLlib to create a machine learning app that
4
4
ms.service: hdinsight
5
5
ms.topic: how-to
6
6
ms.custom: hdinsightactive, devx-track-python
7
-
ms.date: 06/23/2023
7
+
ms.date: 04/08/2024
8
8
---
9
9
10
10
# Use Apache Spark MLlib to build a machine learning application and analyze a dataset
11
11
12
-
Learn how to use Apache Spark MLlib to create a machine learning application. The application will do predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
12
+
Learn how to use Apache Spark MLlib to create a machine learning application. The application does predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
13
13
14
14
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, such as:
15
15
@@ -28,11 +28,11 @@ Logistic regression is the algorithm that you use for classification. Spark's lo
28
28
29
29
In summary, the process of logistic regression produces a *logistic function*. Use the function to predict the probability that an input vector belongs in one group or the other.
30
30
31
-
## Predictive analysis example on food inspection data
31
+
## Predictive analysis example of food inspection data
32
32
33
33
In this example, you use Spark to do some predictive analysis on food inspection data (**Food_Inspections1.csv**). Data acquired through the [City of Chicago data portal](https://data.cityofchicago.org/). This dataset contains information about food establishment inspections that were conducted in Chicago. Including information about each establishment, the violations found (if any), and the results of the inspection. The CSV data file is already available in the storage account associated with the cluster at **/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv**.
34
34
35
-
In the steps below, you develop a model to see what it takes to pass or fail a food inspection.
35
+
In the following steps, you develop a model to see what it takes to pass or fail a food inspection.
36
36
37
37
## Create an Apache Spark MLlib machine learning app
38
38
@@ -60,9 +60,9 @@ Use the Spark context to pull the raw CSV data into memory as unstructured text.
60
60
```PySpark
61
61
def csvParse(s):
62
62
import csv
63
-
from StringIO import StringIO
63
+
from io import StringIO
64
64
sio = StringIO(s)
65
-
value = csv.reader(sio).next()
65
+
value = next(csv.reader(sio))
66
66
sio.close()
67
67
return value
68
68
@@ -227,11 +227,11 @@ Let's start to get a sense of what the dataset contains.
227
227
228
228
## Create a logistic regression model from the input dataframe
229
229
230
-
The final task is to convert the labeled data. Convert the data into a format that can be analyzed by logistic regression. The input to a logistic regression algorithm needs a set of *label-feature vector pairs*. Where the "feature vector" is a vector of numbers that represent the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text. Convert the column to an array of real numbers that a machine could easily understand.
230
+
The final task is to convert the labeled data. Convert the data into a format that analyzed by logistic regression. The input to a logistic regression algorithm needs a set of *label-feature vector pairs*. Where the "feature vector" is a vector of numbers that represent the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text. Convert the column to an array of real numbers that a machine could easily understand.
231
231
232
-
One standard machine learning approach for processing natural language is to assign each distinct word an "index". Then pass a vector to the machine learning algorithm. Such that each index's value contains the relative frequency of that word in the text string.
232
+
One standard machine learning approach for processing natural language is to assign each distinct word an index. Then pass a vector to the machine learning algorithm. Such that each index's value contains the relative frequency of that word in the text string.
233
233
234
-
MLlib provides an easy way to do this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a "pipeline".
234
+
MLlib provides an easy way to do this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a pipeline.
@@ -244,7 +244,7 @@ model = pipeline.fit(labeledData)
244
244
245
245
## Evaluate the model using another dataset
246
246
247
-
You can use the model you created earlier to *predict* what the results of new inspections will be. The predictions are based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
247
+
You can use the model you created earlier to *predict* what the results of new inspections are. The predictions are based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
248
248
249
249
1. Run the following code to create a new dataframe, **predictionsDf** that contains the prediction generated by the model. The snippet also creates a temporary table called **Predictions** based on the dataframe.
250
250
@@ -288,8 +288,8 @@ You can use the model you created earlier to *predict* what the results of new i
288
288
results = 'Pass w/ Conditions'))""").count()
289
289
numInspections = predictionsDf.count()
290
290
291
-
print "There were", numInspections, "inspections and there were", numSuccesses, "successful predictions"
@@ -349,7 +349,7 @@ You can now construct a final visualization to help you reason about the results
349
349
350
350
## Shut down the notebook
351
351
352
-
After you have finished running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
352
+
After running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
0 commit comments