Skip to content

Commit 4cc70e1

Browse files
committed
freshness43
1 parent ab26b5e commit 4cc70e1

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

articles/hdinsight/spark/apache-spark-ipython-notebook-machine-learning.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,18 @@ author: hrasheed-msft
55
ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdinsightactive,mvc
98
ms.topic: tutorial
10-
ms.date: 06/26/2019
9+
ms.custom: hdinsightactive,mvc
10+
ms.date: 04/07/2020
1111

1212
#customer intent: As a developer new to Apache Spark and to Apache Spark in Azure HDInsight, I want to learn how to create a simple machine learning Spark application.
1313
---
1414

1515
# Tutorial: Build an Apache Spark machine learning application in Azure HDInsight
1616

17-
In this tutorial, you learn how to use the [Jupyter Notebook](https://jupyter.org/) to build an [Apache Spark](https://spark.apache.org/) machine learning application for Azure HDInsight.
17+
In this tutorial, you learn how to use the [Jupyter Notebook](https://jupyter.org/) to build an [Apache Spark](./apache-spark-overview.md) machine learning application for Azure HDInsight.
1818

19-
[MLlib](https://spark.apache.org/docs/latest/ml-guide.html) is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
19+
[MLlib](https://spark.apache.org/docs/latest/ml-guide.html) is Spark's adaptable machine learning library consisting of common learning algorithms and utilities. (Classification, regression, clustering, collaborative filtering, and dimensionality reduction. Also, underlying optimization primitives.)
2020

2121
In this tutorial, you learn how to:
2222
> [!div class="checklist"]
@@ -30,13 +30,13 @@ In this tutorial, you learn how to:
3030

3131
## Understand the data set
3232

33-
The application uses the sample **HVAC.csv** data that is available on all clusters by default. The file is located at `\HdiSamples\HdiSamples\SensorSampleData\hvac`. The data shows the target temperature and the actual temperature of some buildings that have HVAC systems installed. The **System** column represents the system ID and the **SystemAge** column represents the number of years the HVAC system has been in place at the building. Using the data, you can predict whether a building will be hotter or colder based on the target temperature, given a system ID, and system age.
33+
The application uses the sample **HVAC.csv** data that is available on all clusters by default. The file is located at `\HdiSamples\HdiSamples\SensorSampleData\hvac`. The data shows the target temperature and the actual temperature of some buildings that have HVAC systems installed. The **System** column represents the system ID and the **SystemAge** column represents the number of years the HVAC system has been in place at the building. You can predict whether a building will be hotter or colder based on the target temperature, given system ID, and system age.
3434

3535
![Snapshot of data used for Spark machine learning example](./media/apache-spark-ipython-notebook-machine-learning/spark-machine-learning-understand-data.png "Snapshot of data used for Spark machine learning example")
3636

3737
## Develop a Spark machine learning application using Spark MLlib
3838

39-
In this application, you use a Spark [ML pipeline](https://spark.apache.org/docs/2.2.0/ml-pipeline.html) to perform a document classification. ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines. In the pipeline, you split the document into words, convert the words into a numerical feature vector, and finally build a prediction model using the feature vectors and labels. Perform the following steps to create the application.
39+
This application uses a Spark [ML pipeline](https://spark.apache.org/docs/2.2.0/ml-pipeline.html) to do a document classification. ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames. The DataFrames help users create and tune practical machine learning pipelines. In the pipeline, you split the document into words, convert the words into a numerical feature vector, and finally build a prediction model using the feature vectors and labels. Do the following steps to create the application.
4040

4141
1. Create a Jupyter notebook using the PySpark kernel. For the instructions, see [Create a Jupyter notebook](./apache-spark-jupyter-spark-sql.md#create-a-jupyter-notebook).
4242

@@ -140,9 +140,9 @@ In this application, you use a Spark [ML pipeline](https://spark.apache.org/docs
140140
141141
![Output data snapshot for Spark machine learning example](./media/apache-spark-ipython-notebook-machine-learning/spark-machine-learning-output-data.png "Output data snapshot for Spark machine learning example")
142142
143-
Notice how the actual temperature is less than the target temperature suggesting the building is cold. Hence in the training output, the value for **label** in the first row is **0.0**, which means the building is not hot.
143+
Notice how the actual temperature is less than the target temperature suggesting the building is cold. The value for **label** in the first row is **0.0**, which means the building isn't hot.
144144
145-
1. Prepare a data set to run the trained model against. To do so, you pass on a system ID and system age (denoted as **SystemInfo** in the training output), and the model predicts whether the building with that system ID and system age will be hotter (denoted by 1.0) or cooler (denoted by 0.0).
145+
1. Prepare a data set to run the trained model against. To do so, you pass on a system ID and system age (denoted as **SystemInfo** in the training output). The model predicts whether the building with that system ID and system age will be hotter (denoted by 1.0) or cooler (denoted by 0.0).
146146
147147
```PySpark
148148
# SystemInfo here is a combination of system ID followed by system age
@@ -177,7 +177,7 @@ In this application, you use a Spark [ML pipeline](https://spark.apache.org/docs
177177
Row(SystemInfo=u'7 22', prediction=0.0, probability=DenseVector([0.5015, 0.4985]))
178178
```
179179
180-
From the first row in the prediction, you can see that for an HVAC system with ID 20 and system age of 25 years, the building is hot (**prediction=1.0**). The first value for DenseVector (0.49999) corresponds to the prediction 0.0 and the second value (0.5001) corresponds to the prediction 1.0. In the output, even though the second value is only marginally higher, the model shows **prediction=1.0**.
180+
Observe the first row in the prediction. For an HVAC system with ID 20 and system age of 25 years, the building is hot (**prediction=1.0**). The first value for DenseVector (0.49999) corresponds to the prediction 0.0 and the second value (0.5001) corresponds to the prediction 1.0. In the output, even though the second value is only marginally higher, the model shows **prediction=1.0**.
181181
182182
1. Shut down the notebook to release the resources. To do so, from the **File** menu on the notebook, select **Close and Halt**. This action shuts down and closes the notebook.
183183
@@ -199,7 +199,7 @@ If you're not going to continue to use this application, delete the cluster that
199199
200200
1. Select **Delete**. Select **Yes**.
201201
202-
![Azure portal delete an HDInsight cluster](./media/apache-spark-ipython-notebook-machine-learning/hdinsight-azure-portal-delete-cluster.png "Delete HDInsight cluster")
202+
![Azure portal deletes an HDInsight cluster](./media/apache-spark-ipython-notebook-machine-learning/hdinsight-azure-portal-delete-cluster.png "Delete HDInsight cluster")
203203
204204
## Next steps
205205

0 commit comments

Comments
 (0)