Skip to content

Commit 12e0c9a

Browse files
Merge pull request #264767 from ssalgadodev/patch-66
Update how-to-auto-train-models-v1.md
2 parents e777911 + ba9684c commit 12e0c9a

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

articles/machine-learning/v1/how-to-auto-train-models-v1.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ms.topic: how-to
99
author: manashgoswami
1010
ms.author: magoswam
1111
ms.reviewer: ssalgado
12-
ms.date: 11/04/2022
12+
ms.date: 01/25/2023
1313
ms.custom: UpdateFrequency5, devx-track-python, automl, FY21Q4-aml-seo-hack, contperf-fy21q4, sdkv1, event-tier1-build-2022
1414
---
1515

@@ -23,7 +23,7 @@ This process accepts training data and configuration settings, and automatically
2323

2424
![Flow diagram](./media/how-to-auto-train-models/flow2.png)
2525

26-
You'll write code using the Python SDK in this article. You'll learn the following tasks:
26+
You write code using the Python SDK in this article. You learn the following tasks:
2727

2828
> [!div class="checklist"]
2929
> * Download, transform, and clean data using Azure Open Datasets
@@ -63,7 +63,7 @@ from datetime import datetime
6363
from dateutil.relativedelta import relativedelta
6464
```
6565

66-
Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets.
66+
Begin by creating a dataframe to hold the taxi data. When you work in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets.
6767

6868
To download taxi data, iteratively fetch one month at a time, and before appending it to `green_taxi_df` randomly sample 2,000 records from each month to avoid bloating the dataframe. Then preview the data.
6969

@@ -94,7 +94,7 @@ green_taxi_df.head(10)
9494
|150436|2|2015-01-11 17:15:14|2015-01-11 17:22:57|1|1.19|None|None|-73.94|40.71|-73.95|...|1|7.00|0.00|0.50|0.3|1.75|0.00|nan|9.55|
9595
|432136|2|2015-01-22 23:16:33 2015-01-22 23:20:13 1 0.65|None|None|-73.94|40.71|-73.94|...|2|5.00|0.50|0.50|0.3|0.00|0.00|nan|6.30|
9696

97-
Remove some of the columns that you won't need for training or additional feature building. Automate machine learning will automatically handle time-based features such as **lpepPickupDatetime**.
97+
Remove some of the columns that you won't need for training or other feature building. Automate machine learning will automatically handle time-based features such as **lpepPickupDatetime**.
9898

9999
```python
100100
columns_to_remove = ["lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
@@ -127,7 +127,7 @@ green_taxi_df.describe()
127127
|max|2.00|9.00|97.57|0.00|41.93|0.00|41.94|450.00|12.00|30.00|
128128

129129

130-
From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features.
130+
From the summary statistics, you see that there are several fields that have outliers or values that reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This filters out longer taxi trips or trips that are outliers in respect to their relationship with other features.
131131

132132
Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.
133133

@@ -186,17 +186,17 @@ To automatically train a model, take the following steps:
186186

187187
### Define training settings
188188

189-
Define the experiment parameter and model settings for training. View the full list of [settings](how-to-configure-auto-train.md). Submitting the experiment with these default settings will take approximately 5-20 min, but if you want a shorter run time, reduce the `experiment_timeout_hours` parameter.
189+
Define the experiment parameter and model settings for training. View the full list of [settings](how-to-configure-auto-train.md). Submitting the experiment with these default settings take approximately 5-20 min, but if you want a shorter run time, reduce the `experiment_timeout_hours` parameter.
190190

191191
|Property| Value in this article |Description|
192192
|----|----|---|
193193
|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Increase this value for larger datasets that need more time for each iteration.|
194194
|**experiment_timeout_hours**|0.3|Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|
195-
|**enable_early_stopping**|True|Flag to enable early termination if the score is not improving in the short term.|
196-
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
195+
|**enable_early_stopping**|True|Flag to enable early termination if the score isn't improving in the short term.|
196+
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model is chosen based on this metric.|
197197
|**featurization**| auto | By using **auto**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
198198
|**verbosity**| logging.INFO | Controls the level of logging.|
199-
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|
199+
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data isn't specified.|
200200

201201
```python
202202
import logging
@@ -363,7 +363,7 @@ The traditional machine learning model development process is highly resource-in
363363

364364
## Clean up resources
365365

366-
Do not complete this section if you plan on running other Azure Machine Learning tutorials.
366+
Don't complete this section if you plan on running other Azure Machine Learning tutorials.
367367

368368
### Stop the compute instance
369369

0 commit comments

Comments
 (0)