Skip to content

Commit 4a45d41

Browse files
Merge pull request #271441 from sreekzz/patch-26
Modified code block as per User Review
2 parents 72bcd57 + d1c1af5 commit 4a45d41

File tree

1 file changed

+13
-13
lines changed

1 file changed

+13
-13
lines changed

articles/hdinsight/spark/apache-spark-machine-learning-mllib-ipython.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ description: Learn how to use Spark MLlib to create a machine learning app that
44
ms.service: hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive, devx-track-python
7-
ms.date: 06/23/2023
7+
ms.date: 04/08/2024
88
---
99

1010
# Use Apache Spark MLlib to build a machine learning application and analyze a dataset
1111

12-
Learn how to use Apache Spark MLlib to create a machine learning application. The application will do predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
12+
Learn how to use Apache Spark MLlib to create a machine learning application. The application does predictive analysis on an open dataset. From Spark's built-in machine learning libraries, this example uses *classification* through logistic regression.
1313

1414
MLlib is a core Spark library that provides many utilities useful for machine learning tasks, such as:
1515

@@ -28,11 +28,11 @@ Logistic regression is the algorithm that you use for classification. Spark's lo
2828

2929
In summary, the process of logistic regression produces a *logistic function*. Use the function to predict the probability that an input vector belongs in one group or the other.
3030

31-
## Predictive analysis example on food inspection data
31+
## Predictive analysis example of food inspection data
3232

3333
In this example, you use Spark to do some predictive analysis on food inspection data (**Food_Inspections1.csv**). Data acquired through the [City of Chicago data portal](https://data.cityofchicago.org/). This dataset contains information about food establishment inspections that were conducted in Chicago. Including information about each establishment, the violations found (if any), and the results of the inspection. The CSV data file is already available in the storage account associated with the cluster at **/HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv**.
3434

35-
In the steps below, you develop a model to see what it takes to pass or fail a food inspection.
35+
In the following steps, you develop a model to see what it takes to pass or fail a food inspection.
3636

3737
## Create an Apache Spark MLlib machine learning app
3838

@@ -60,9 +60,9 @@ Use the Spark context to pull the raw CSV data into memory as unstructured text.
6060
```PySpark
6161
def csvParse(s):
6262
import csv
63-
from StringIO import StringIO
63+
from io import StringIO
6464
sio = StringIO(s)
65-
value = csv.reader(sio).next()
65+
value = next(csv.reader(sio))
6666
sio.close()
6767
return value
6868
@@ -227,11 +227,11 @@ Let's start to get a sense of what the dataset contains.
227227
228228
## Create a logistic regression model from the input dataframe
229229
230-
The final task is to convert the labeled data. Convert the data into a format that can be analyzed by logistic regression. The input to a logistic regression algorithm needs a set of *label-feature vector pairs*. Where the "feature vector" is a vector of numbers that represent the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text. Convert the column to an array of real numbers that a machine could easily understand.
230+
The final task is to convert the labeled data. Convert the data into a format that analyzed by logistic regression. The input to a logistic regression algorithm needs a set of *label-feature vector pairs*. Where the "feature vector" is a vector of numbers that represent the input point. So, you need to convert the "violations" column, which is semi-structured and contains many comments in free-text. Convert the column to an array of real numbers that a machine could easily understand.
231231
232-
One standard machine learning approach for processing natural language is to assign each distinct word an "index". Then pass a vector to the machine learning algorithm. Such that each index's value contains the relative frequency of that word in the text string.
232+
One standard machine learning approach for processing natural language is to assign each distinct word an index. Then pass a vector to the machine learning algorithm. Such that each index's value contains the relative frequency of that word in the text string.
233233
234-
MLlib provides an easy way to do this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a "pipeline".
234+
MLlib provides an easy way to do this operation. First, "tokenize" each violations string to get the individual words in each string. Then, use a `HashingTF` to convert each set of tokens into a feature vector that can then be passed to the logistic regression algorithm to construct a model. You conduct all of these steps in sequence using a pipeline.
235235
236236
```PySpark
237237
tokenizer = Tokenizer(inputCol="violations", outputCol="words")
@@ -244,7 +244,7 @@ model = pipeline.fit(labeledData)
244244

245245
## Evaluate the model using another dataset
246246

247-
You can use the model you created earlier to *predict* what the results of new inspections will be. The predictions are based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
247+
You can use the model you created earlier to *predict* what the results of new inspections are. The predictions are based on the violations that were observed. You trained this model on the dataset **Food_Inspections1.csv**. You can use a second dataset, **Food_Inspections2.csv**, to *evaluate* the strength of this model on the new data. This second data set (**Food_Inspections2.csv**) is in the default storage container associated with the cluster.
248248

249249
1. Run the following code to create a new dataframe, **predictionsDf** that contains the prediction generated by the model. The snippet also creates a temporary table called **Predictions** based on the dataframe.
250250

@@ -288,8 +288,8 @@ You can use the model you created earlier to *predict* what the results of new i
288288
results = 'Pass w/ Conditions'))""").count()
289289
numInspections = predictionsDf.count()
290290
291-
print "There were", numInspections, "inspections and there were", numSuccesses, "successful predictions"
292-
print "This is a", str((float(numSuccesses) / float(numInspections)) * 100) + "%", "success rate"
291+
print ("There were", numInspections, "inspections and there were", numSuccesses, "successful predictions")
292+
print ("This is a", str((float(numSuccesses) / float(numInspections)) * 100) + "%", "success rate")
293293
```
294294
295295
The output looks like the following text:
@@ -349,7 +349,7 @@ You can now construct a final visualization to help you reason about the results
349349
350350
## Shut down the notebook
351351
352-
After you have finished running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
352+
After running the application, you should shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
353353
354354
## Next steps
355355

0 commit comments

Comments
 (0)