Skip to content

Commit 57cbfd7

Browse files
authored
Update apache-spark-azure-machine-learning-tutorial.md
Removing all Azure Synapse for Apache Spark 2.4 related content because the 2.4 runtime has been disabled
1 parent 01fefc4 commit 57cbfd7

File tree

1 file changed

+0
-339
lines changed

1 file changed

+0
-339
lines changed
Lines changed: 0 additions & 339 deletions
Original file line numberDiff line numberDiff line change
@@ -1,340 +1 @@
1-
---
2-
title: 'Tutorial: Train a model in Python with automated machine learning (deprecated)'
3-
description: Tutorial on how to train a machine learning model in Python by using Apache Spark and automated machine learning (deprecated).
4-
author: midesa
5-
ms.service: azure-synapse-analytics
6-
ms.topic: tutorial
7-
ms.subservice: machine-learning
8-
ms.custom: devx-track-python
9-
ms.date: 03/06/2024
10-
ms.author: midesa
11-
---
121

13-
# Tutorial: Train a model in Python with automated machine learning (deprecated)
14-
15-
Azure Machine Learning is a cloud-based environment that allows you to train, deploy, automate, manage, and track machine learning models.
16-
17-
In this tutorial, you use [automated machine learning](/azure/machine-learning/concept-automated-ml) in Azure Machine Learning to create a regression model to predict taxi fare prices. This process arrives at the best model by accepting training data and configuration settings, and automatically iterating through combinations of different methods, models, and hyperparameter settings.
18-
19-
In this tutorial, you learn how to:
20-
- Download the data by using Apache Spark and Azure Open Datasets.
21-
- Transform and clean data by using Apache Spark DataFrames.
22-
- Train a regression model in automated machine learning.
23-
- Calculate model accuracy.
24-
25-
## Before you begin
26-
27-
- Create a serverless Apache Spark 2.4 pool by following the [Create a serverless Apache Spark pool](../quickstart-create-apache-spark-pool-studio.md) quickstart.
28-
- Complete the [Azure Machine Learning workspace setup](/azure/machine-learning/quickstart-create-resources) tutorial if you don't have an existing Azure Machine Learning workspace.
29-
30-
> [!WARNING]
31-
> - Effective September 29, 2023, Azure Synapse will discontinue official support for [Spark 2.4 Runtimes](../spark/apache-spark-24-runtime.md). Post September 29, 2023, we will not be addressing any support tickets related to Spark 2.4. There will be no release pipeline in place for bug or security fixes for Spark 2.4. Utilizing Spark 2.4 post the support cutoff date is undertaken at one's own risk. We strongly discourage its continued use due to potential security and functionality concerns.
32-
> - As part of the deprecation process for Apache Spark 2.4, we would like to notify you that AutoML in Azure Synapse Analytics will also be deprecated. This includes both the low code interface and the APIs used to create AutoML trials through code.
33-
> - Please note that AutoML functionality was exclusively available through the Spark 2.4 runtime.
34-
> - For customers who wish to continue leveraging AutoML capabilities, we recommend saving your data into your Azure Data Lake Storage Gen2 (ADLSg2) account. From there, you can seamlessly access the AutoML experience through Azure Machine Learning (AzureML). Further information regarding this workaround is available [here](../machine-learning/access-data-from-aml.md).
35-
36-
## Understand regression models
37-
38-
*Regression models* predict numerical output values based on independent predictors. In regression, the objective is to help establish the relationship among those independent predictor variables by estimating how one variable affects the others.
39-
40-
### Example based on New York City taxi data
41-
42-
In this example, you use Spark to perform some analysis on taxi-trip tip data from New York City (NYC). The data is available through [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/). This subset of the dataset contains information about yellow taxi trips, including information about each trip, the start and end time and locations, and the cost.
43-
44-
> [!IMPORTANT]
45-
> There might be additional charges for pulling this data from its storage location. In the following steps, you develop a model to predict NYC taxi fare prices.
46-
47-
## Download and prepare the data
48-
49-
Here's how:
50-
51-
1. Create a notebook by using the PySpark kernel. For instructions, see [Create a notebook](../quickstart-apache-spark-notebook.md#create-a-notebook).
52-
53-
> [!Note]
54-
> Because of the PySpark kernel, you don't need to create any contexts explicitly. The Spark context is automatically created for you when you run the first code cell.
55-
56-
2. Because the raw data is in a Parquet format, you can use the Spark context to pull the file directly into memory as a DataFrame. Create a Spark DataFrame by retrieving the data via the Open Datasets API. Here, you use the Spark DataFrame `schema on read` properties to infer the datatypes and schema.
57-
58-
```python
59-
blob_account_name = "azureopendatastorage"
60-
blob_container_name = "nyctlc"
61-
blob_relative_path = "yellow"
62-
blob_sas_token = r""
63-
64-
# Allow Spark to read from the blob remotely
65-
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
66-
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),blob_sas_token)
67-
68-
# Spark read parquet; note that it won't load any data yet
69-
df = spark.read.parquet(wasbs_path)
70-
71-
```
72-
73-
3. Depending on the size of your Spark pool, the raw data might be too large or take too much time to operate on. You can filter this data down to something smaller, like a month of data, by using the ```start_date``` and ```end_date``` filters. After you filter a DataFrame, you also run the ```describe()``` function on the new DataFrame to see summary statistics for each field.
74-
75-
Based on the summary statistics, you can see that there are some irregularities in the data. For example, the statistics show that the minimum trip distance is less than 0. You need to filter out these irregular data points.
76-
77-
```python
78-
# Create an ingestion filter
79-
start_date = '2015-01-01 00:00:00'
80-
end_date = '2015-12-31 00:00:00'
81-
82-
filtered_df = df.filter('tpepPickupDateTime > "' + start_date + '" and tpepPickupDateTime< "' + end_date + '"')
83-
84-
filtered_df.describe().show()
85-
```
86-
87-
4. Generate features from the dataset by selecting a set of columns and creating various time-based features from the pickup `datetime` field. Filter out the outliers that were identified from the earlier step, and then remove the last few columns because they're unnecessary for training.
88-
89-
```python
90-
from datetime import datetime
91-
from pyspark.sql.functions import *
92-
93-
# To make development easier, faster, and less expensive, downsample for now
94-
sampled_taxi_df = filtered_df.sample(True, 0.001, seed=1234)
95-
96-
taxi_df = sampled_taxi_df.select('vendorID', 'passengerCount', 'tripDistance', 'startLon', 'startLat', 'endLon' \
97-
, 'endLat', 'paymentType', 'fareAmount', 'tipAmount'\
98-
, column('puMonth').alias('month_num') \
99-
, date_format('tpepPickupDateTime', 'hh').alias('hour_of_day')\
100-
, date_format('tpepPickupDateTime', 'EEEE').alias('day_of_week')\
101-
, dayofmonth(col('tpepPickupDateTime')).alias('day_of_month')
102-
,(unix_timestamp(col('tpepDropoffDateTime')) - unix_timestamp(col('tpepPickupDateTime'))).alias('trip_time'))\
103-
.filter((sampled_taxi_df.passengerCount > 0) & (sampled_taxi_df.passengerCount < 8)\
104-
& (sampled_taxi_df.tipAmount >= 0)\
105-
& (sampled_taxi_df.fareAmount >= 1) & (sampled_taxi_df.fareAmount <= 250)\
106-
& (sampled_taxi_df.tipAmount < sampled_taxi_df.fareAmount)\
107-
& (sampled_taxi_df.tripDistance > 0) & (sampled_taxi_df.tripDistance <= 200)\
108-
& (sampled_taxi_df.rateCodeId <= 5)\
109-
& (sampled_taxi_df.paymentType.isin({"1", "2"})))
110-
taxi_df.show(10)
111-
```
112-
113-
As you can see, this will create a new DataFrame with additional columns for the day of the month, pickup hour, weekday, and total trip time.
114-
115-
![Picture of taxi DataFrame.](./media/azure-machine-learning-spark-notebook/dataset.png#lightbox)
116-
117-
## Generate test and validation datasets
118-
119-
After you have your final dataset, you can split the data into training and test sets by using the ```random_ split ``` function in Spark. By using the provided weights, this function randomly splits the data into the training dataset for model training and the validation dataset for testing.
120-
121-
```python
122-
# Random split dataset using Spark; convert Spark to pandas
123-
training_data, validation_data = taxi_df.randomSplit([0.8,0.2], 223)
124-
125-
```
126-
This step ensures that the data points to test the finished model haven't been used to train the model.
127-
128-
## Connect to an Azure Machine Learning workspace
129-
In Azure Machine Learning, a workspace is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. In this step, you create a workspace object from the existing Azure Machine Learning workspace.
130-
131-
```python
132-
from azureml.core import Workspace
133-
134-
# Enter your subscription id, resource group, and workspace name.
135-
subscription_id = "<enter your subscription ID>" #you should be owner or contributor
136-
resource_group = "<enter your resource group>" #you should be owner or contributor
137-
workspace_name = "<enter your workspace name>" #your workspace name
138-
139-
ws = Workspace(workspace_name = workspace_name,
140-
subscription_id = subscription_id,
141-
resource_group = resource_group)
142-
```
143-
144-
## Convert a DataFrame to an Azure Machine Learning dataset
145-
To submit a remote experiment, convert your dataset into an Azure Machine Learning ```TabularDatset``` instance. [TabularDataset](/python/api/azureml-core/azureml.data.tabulardataset) represents data in a tabular format by parsing the provided files.
146-
147-
The following code gets the existing workspace and the default Azure Machine Learning datastore. It then passes the datastore and file locations to the path parameter to create a new ```TabularDataset``` instance.
148-
149-
```python
150-
import pandas
151-
from azureml.core import Dataset
152-
153-
# Get the Azure Machine Learning default datastore
154-
datastore = ws.get_default_datastore()
155-
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
156-
157-
# Convert into an Azure Machine Learning tabular dataset
158-
datastore.upload_files(files = ['training_pd.csv'],
159-
target_path = 'train-dataset/tabular/',
160-
overwrite = True,
161-
show_progress = True)
162-
dataset_training = Dataset.Tabular.from_delimited_files(path = [(datastore, 'train-dataset/tabular/training_pd.csv')])
163-
```
164-
![Picture of uploaded dataset.](./media/azure-machine-learning-spark-notebook/upload-dataset.png)
165-
166-
## Submit an automated experiment
167-
168-
The following sections walk you through the process of submitting an automated machine learning experiment.
169-
170-
### Define training settings
171-
1. To submit an experiment, you need to define the experiment parameter and model settings for training. For the full list of settings, see [Configure automated machine learning experiments in Python](/azure/machine-learning/how-to-configure-auto-train).
172-
173-
```python
174-
import logging
175-
176-
automl_settings = {
177-
"iteration_timeout_minutes": 10,
178-
"experiment_timeout_minutes": 30,
179-
"enable_early_stopping": True,
180-
"primary_metric": 'r2_score',
181-
"featurization": 'auto',
182-
"verbosity": logging.INFO,
183-
"n_cross_validations": 2}
184-
```
185-
186-
1. Pass the defined training settings as a `kwargs` parameter to an `AutoMLConfig` object. Because you're using Spark, you must also pass the Spark context, which is automatically accessible by the ```sc``` variable. Additionally, you specify the training data and the type of model, which is regression in this case.
187-
188-
```python
189-
from azureml.train.automl import AutoMLConfig
190-
191-
automl_config = AutoMLConfig(task='regression',
192-
debug_log='automated_ml_errors.log',
193-
training_data = dataset_training,
194-
spark_context = sc,
195-
model_explainability = False,
196-
label_column_name ="fareAmount",**automl_settings)
197-
```
198-
199-
> [!NOTE]
200-
>Automated machine learning pre-processing steps become part of the underlying model. These steps include feature normalization, handling missing data, and converting text to numeric. When you're using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.
201-
202-
### Train the automatic regression model
203-
Next, you create an experiment object in your Azure Machine Learning workspace. An experiment acts as a container for your individual runs.
204-
205-
```python
206-
from azureml.core.experiment import Experiment
207-
208-
# Start an experiment in Azure Machine Learning
209-
experiment = Experiment(ws, "aml-synapse-regression")
210-
tags = {"Synapse": "regression"}
211-
local_run = experiment.submit(automl_config, show_output=True, tags = tags)
212-
213-
# Use the get_details function to retrieve the detailed output for the run.
214-
run_details = local_run.get_details()
215-
```
216-
When the experiment has finished, the output returns details about the completed iterations. For each iteration, you see the model type, the run duration, and the training accuracy. The `BEST` field tracks the best-running training score based on your metric type.
217-
218-
![Screenshot of model output.](./media/azure-machine-learning-spark-notebook/model-output.png)
219-
220-
> [!NOTE]
221-
> After you submit the automated machine learning experiment, it runs various iterations and model types. This run typically takes 60 to 90 minutes.
222-
223-
### Retrieve the best model
224-
To select the best model from your iterations, use the ```get_output``` function to return the best run and fitted model. The following code retrieves the best run and fitted model for any logged metric or a particular iteration.
225-
226-
```python
227-
# Get best model
228-
best_run, fitted_model = local_run.get_output()
229-
```
230-
231-
### Test model accuracy
232-
1. To test the model accuracy, use the best model to run taxi fare predictions on the test dataset. The ```predict``` function uses the best model and predicts the values of `y` (fare amount) from the validation dataset.
233-
234-
```python
235-
# Test best model accuracy
236-
validation_data_pd = validation_data.toPandas()
237-
y_test = validation_data_pd.pop("fareAmount").to_frame()
238-
y_predict = fitted_model.predict(validation_data_pd)
239-
```
240-
241-
1. The root-mean-square error is a frequently used measure of the differences between sample values predicted by a model and the values observed. You calculate the root-mean-square error of the results by comparing the `y_test` DataFrame to the values predicted by the model.
242-
243-
The function ```mean_squared_error``` takes two arrays and calculates the average squared error between them. You then take the square root of the result. This metric indicates roughly how far the taxi fare predictions are from the actual fare values.
244-
245-
```python
246-
from sklearn.metrics import mean_squared_error
247-
from math import sqrt
248-
249-
# Calculate root-mean-square error
250-
y_actual = y_test.values.flatten().tolist()
251-
rmse = sqrt(mean_squared_error(y_actual, y_predict))
252-
253-
print("Root Mean Square Error:")
254-
print(rmse)
255-
```
256-
257-
```Output
258-
Root Mean Square Error:
259-
2.309997102577151
260-
```
261-
The root-mean-square error is a good measure of how accurately the model predicts the response. From the results, you see that the model is fairly good at predicting taxi fares from the dataset's features, typically within $2.00.
262-
263-
1. Run the following code to calculate the mean-absolute-percent error. This metric expresses accuracy as a percentage of the error. It does this by calculating an absolute difference between each predicted and actual value and then summing all the differences. Then, it expresses that sum as a percentage of the total of the actual values.
264-
265-
```python
266-
# Calculate mean-absolute-percent error and model accuracy
267-
sum_actuals = sum_errors = 0
268-
269-
for actual_val, predict_val in zip(y_actual, y_predict):
270-
abs_error = actual_val - predict_val
271-
if abs_error < 0:
272-
abs_error = abs_error * -1
273-
274-
sum_errors = sum_errors + abs_error
275-
sum_actuals = sum_actuals + actual_val
276-
277-
mean_abs_percent_error = sum_errors / sum_actuals
278-
279-
print("Model MAPE:")
280-
print(mean_abs_percent_error)
281-
print()
282-
print("Model Accuracy:")
283-
print(1 - mean_abs_percent_error)
284-
```
285-
286-
```Output
287-
Model MAPE:
288-
0.03655071038487368
289-
290-
Model Accuracy:
291-
0.9634492896151263
292-
```
293-
From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the dataset's features.
294-
295-
1. After fitting a linear regression model, you now need to determine how well the model fits the data. To do this, you plot the actual fare values against the predicted output. In addition, you calculate the R-squared measure to understand how close the data is to the fitted regression line.
296-
297-
```python
298-
import matplotlib.pyplot as plt
299-
import numpy as np
300-
from sklearn.metrics import mean_squared_error, r2_score
301-
302-
# Calculate the R2 score by using the predicted and actual fare prices
303-
y_test_actual = y_test["fareAmount"]
304-
r2 = r2_score(y_test_actual, y_predict)
305-
306-
# Plot the actual versus predicted fare amount values
307-
plt.style.use('ggplot')
308-
plt.figure(figsize=(10, 7))
309-
plt.scatter(y_test_actual,y_predict)
310-
plt.plot([np.min(y_test_actual), np.max(y_test_actual)], [np.min(y_test_actual), np.max(y_test_actual)], color='lightblue')
311-
plt.xlabel("Actual Fare Amount")
312-
plt.ylabel("Predicted Fare Amount")
313-
plt.title("Actual vs Predicted Fare Amount R^2={}".format(r2))
314-
plt.show()
315-
316-
```
317-
![Screenshot of a regression plot.](./media/azure-machine-learning-spark-notebook/fare-amount.png)
318-
319-
From the results, you can see that the R-squared measure accounts for 95 percent of the variance. This is also validated by the actual plot versus the observed plot. The more variance that the regression model accounts for, the closer the data points will fall to the fitted regression line.
320-
321-
## Register the model to Azure Machine Learning
322-
After you've validated your best model, you can register it to Azure Machine Learning. Then, you can download or deploy the registered model and receive all the files that you registered.
323-
324-
```python
325-
description = 'My automated ML model'
326-
model_path='outputs/model.pkl'
327-
model = best_run.register_model(model_name = 'NYCYellowTaxiModel', model_path = model_path, description = description)
328-
print(model.name, model.version)
329-
```
330-
```Output
331-
NYCYellowTaxiModel 1
332-
```
333-
## View results in Azure Machine Learning
334-
You can also access the results of the iterations by going to the experiment in your Azure Machine Learning workspace. Here, you can get additional details on the status of your run, attempted models, and other model metrics.
335-
336-
![Screenshot of an Azure Machine Learning workspace.](./media/azure-machine-learning-spark-notebook/azure-machine-learning-workspace.png)
337-
338-
## Next steps
339-
- [Azure Synapse Analytics](../index.yml)
340-
- [Tutorial: Build a machine learning app with Apache Spark MLlib and Azure Synapse Analytics](./apache-spark-machine-learning-mllib-notebook.md)

0 commit comments

Comments
 (0)