You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You'll write code using the Python SDK in this article. You'll learn the following tasks:
26
+
You write code using the Python SDK in this article. You learn the following tasks:
27
27
28
28
> [!div class="checklist"]
29
29
> * Download, transform, and clean data using Azure Open Datasets
@@ -63,7 +63,7 @@ from datetime import datetime
63
63
from dateutil.relativedelta import relativedelta
64
64
```
65
65
66
-
Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets.
66
+
Begin by creating a dataframe to hold the taxi data. When you work in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets.
67
67
68
68
To download taxi data, iteratively fetch one month at a time, and before appending it to `green_taxi_df` randomly sample 2,000 records from each month to avoid bloating the dataframe. Then preview the data.
Remove some of the columns that you won't need for training or additional feature building. Automate machine learning will automatically handle time-based features such as **lpepPickupDatetime**.
97
+
Remove some of the columns that you won't need for training or other feature building. Automate machine learning will automatically handle time-based features such as **lpepPickupDatetime**.
From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features.
130
+
From the summary statistics, you see that there are several fields that have outliers or values that reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This filters out longer taxi trips or trips that are outliers in respect to their relationship with other features.
131
131
132
132
Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.
133
133
@@ -186,17 +186,17 @@ To automatically train a model, take the following steps:
186
186
187
187
### Define training settings
188
188
189
-
Define the experiment parameter and model settings for training. View the full list of [settings](how-to-configure-auto-train.md). Submitting the experiment with these default settings will take approximately 5-20 min, but if you want a shorter run time, reduce the `experiment_timeout_hours` parameter.
189
+
Define the experiment parameter and model settings for training. View the full list of [settings](how-to-configure-auto-train.md). Submitting the experiment with these default settings take approximately 5-20 min, but if you want a shorter run time, reduce the `experiment_timeout_hours` parameter.
190
190
191
191
|Property| Value in this article |Description|
192
192
|----|----|---|
193
193
|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Increase this value for larger datasets that need more time for each iteration.|
194
194
|**experiment_timeout_hours**|0.3|Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|
195
-
|**enable_early_stopping**|True|Flag to enable early termination if the score is not improving in the short term.|
196
-
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
195
+
|**enable_early_stopping**|True|Flag to enable early termination if the score isn't improving in the short term.|
196
+
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model is chosen based on this metric.|
197
197
|**featurization**| auto | By using **auto**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
198
198
|**verbosity**| logging.INFO | Controls the level of logging.|
199
-
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|
199
+
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data isn't specified.|
200
200
201
201
```python
202
202
import logging
@@ -363,7 +363,7 @@ The traditional machine learning model development process is highly resource-in
363
363
364
364
## Clean up resources
365
365
366
-
Do not complete this section if you plan on running other Azure Machine Learning tutorials.
366
+
Don't complete this section if you plan on running other Azure Machine Learning tutorials.
0 commit comments