Skip to content

Commit 9be2660

Browse files
author
Trevor Bye
committed
removed data prep code. edits for clarity
1 parent 47e3859 commit 9be2660

File tree

1 file changed

+29
-56
lines changed

1 file changed

+29
-56
lines changed

articles/machine-learning/service/how-to-configure-auto-train.md

Lines changed: 29 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,10 @@ automl_config = AutoMLConfig(task="classification")
6464
```
6565

6666
## Data source and format
67+
6768
Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. The data can be read into scikit-learn supported data formats. You can read the data into:
68-
* Numpy arrays X (features) and y (target variable or also known as label)
69+
70+
* Numpy arrays X (features) and y (target variable, also known as label)
6971
* Pandas dataframe
7072

7173
>[!Important]
@@ -88,55 +90,25 @@ Examples:
8890
```python
8991
import pandas as pd
9092
from sklearn.model_selection import train_test_split
93+
9194
df = pd.read_csv("https://automldemods.blob.core.windows.net/datasets/PlayaEvents2016,_1.6MB,_3.4k-rows.cleaned.2.tsv", delimiter="\t", quotechar='"')
92-
# get integer labels
93-
y = df["Label"]
94-
df = df.drop(["Label"], axis=1)
95-
df_train, _, y_train, _ = train_test_split(df, y, test_size=0.1, random_state=42)
95+
y_df = df["Label"]
96+
x_df = df.drop(["Label"], axis=1)
97+
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.1, random_state=42)
9698
```
9799

98100
## Fetch data for running experiment on remote compute
99101

100-
For remote executions, you need to make the data accessible from the remote compute. This can be done by uploading the data to DataStore.
101-
102-
Here is an example of using `datastore`:
103-
104-
```python
105-
import pandas as pd
106-
from sklearn import datasets
107-
108-
data_train = datasets.load_digits()
109-
110-
pd.DataFrame(data_train.data[100:,:]).to_csv("data/X_train.csv", index=False)
111-
pd.DataFrame(data_train.target[100:]).to_csv("data/y_train.csv", index=False)
112-
113-
ds = ws.get_default_datastore()
114-
ds.upload(src_dir='./data', target_path='digitsdata', overwrite=True, show_progress=True)
115-
```
116-
117-
### Define dprep references
102+
For remote executions, training data must be accessible from the remote compute. The class [`Datasets`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) in the SDK exposes functionality to:
118103

119-
Define X and y as dprep reference, which will be passed to automated machine learning `AutoMLConfig` object similar to below:
104+
* easily transfer data from static files or URL sources into your workspace
105+
* make your data available to training scripts when running on cloud compute resources
120106

121-
```python
122-
123-
X = dprep.auto_read_file(path=ds.path('digitsdata/X_train.csv'))
124-
y = dprep.auto_read_file(path=ds.path('digitsdata/y_train.csv'))
125-
126-
127-
automl_config = AutoMLConfig(task = 'classification',
128-
debug_log = 'automl_errors.log',
129-
path = project_folder,
130-
run_configuration=conda_run_config,
131-
X = X,
132-
y = y,
133-
**automl_settings
134-
)
135-
```
107+
See the [how-to](how-to-train-with-datasets.md#option-2--mount-files-to-a-remote-compute-target) for an example of using the `Dataset` class to mount data to your compute target.
136108

137109
## Train and validation data
138110

139-
You can specify separate train and validation set directly in the `AutoMLConfig` method.
111+
You can specify separate train and validation sets directly in the `AutoMLConfig` constructor.
140112

141113
### K-Folds Cross Validation
142114

@@ -170,7 +142,7 @@ There are several options that you can use to configure your automated machine l
170142

171143
Some examples include:
172144

173-
1. Classification experiment using AUC weighted as the primary metric with a max time of 12,000 seconds per iteration, with the experiment to end after 50 iterations and 2 cross validation folds.
145+
1. Classification experiment using AUC weighted as the primary metric with a max time of 12,000 seconds per iteration, with the experiment to end after 50 iterations and 2 cross-validation folds.
174146

175147
```python
176148
automl_classifier = AutoMLConfig(
@@ -197,12 +169,10 @@ Some examples include:
197169
n_cross_validations=5)
198170
```
199171

200-
The three different `task` parameter values determine the list of models to apply. Use the `whitelist` or `blacklist` parameters to further modify iterations with the available models to include or exclude. The list of supported models can be found on [SupportedModels Class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.constants.supportedmodels?view=azure-ml-py).
172+
The three different `task` parameter values (the third task-type is `forecasting`, and uses the same algorithm pool as `regression` tasks) determine the list of models to apply. Use the `whitelist` or `blacklist` parameters to further modify iterations with the available models to include or exclude. The list of supported models can be found on [SupportedModels Class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.constants.supportedmodels?view=azure-ml-py).
201173

202174
### Primary Metric
203-
The primary metric; as shown in the examples above determines the metric to be used during model training for optimization. The primary metric you can select is determined by the task type you choose. Below is a list of available metrics.
204-
205-
Learn about the specific definitions of these in [Understand automated machine learning results](how-to-understand-automated-ml.md).
175+
The primary metric determines the metric to be used during model training for optimization. The available metrics you can select is determined by the task type you choose, and the following table shows valid primary metrics for each task type.
206176

207177
|Classification | Regression | Time Series Forecasting
208178
|-- |-- |--
@@ -212,9 +182,11 @@ Learn about the specific definitions of these in [Understand automated machine l
212182
|norm_macro_recall | normalized_mean_absolute_error | normalized_mean_absolute_error
213183
|precision_score_weighted |
214184

185+
Learn about the specific definitions of these in [Understand automated machine learning results](how-to-understand-automated-ml.md).
186+
215187
### Data preprocessing & featurization
216188

217-
In every automated machine learning experiment, your data is [automatically scaled and normalized](concept-automated-ml.md#preprocess) to help algorithms perform well. However, you can also enable additional preprocessing/featurization, such as missing values imputation, encoding, and transforms. [Learn more about what featurization is included](how-to-create-portal-experiments.md#preprocess).
189+
In every automated machine learning experiment, your data is [automatically scaled and normalized](concept-automated-ml.md#preprocess) to help *certain* algorithms that are sensitive to features that are on different scales. However, you can also enable additional preprocessing/featurization, such as missing values imputation, encoding, and transforms. [Learn more about what featurization is included](how-to-create-portal-experiments.md#preprocess).
218190

219191
To enable this featurization, specify `"preprocess": True` for the [`AutoMLConfig` class](https://docs.microsoft.com/python/api/azureml-train-automl/azureml.train.automl.automlconfig?view=azure-ml-py).
220192

@@ -225,12 +197,13 @@ To enable this featurization, specify `"preprocess": True` for the [`AutoMLConfi
225197
> your input data automatically.
226198
227199
### Time Series Forecasting
228-
For time series forecasting task type you have additional parameters to define.
229-
1. time_column_name - This is a required parameter which defines the name of the column in your training data containing date/time series.
230-
1. max_horizon - This defines the length of time you want to predict out based on the periodicity of the training data. For example if you have training data with daily time grains, you define how far out in days you want the model to train for.
231-
1. grain_column_names - This defines the name of columns which contain individual time series data in your training data. For example, if you are forecasting sales of a particular brand by store, you would define store and brand columns as your grain columns.
200+
The time series `forecasting` task requires additional parameters in the configuration object:
201+
202+
1. `time_column_name`: Required parameter that defines the name of the column in your training data containing a valid time-series.
203+
1. `max_horizon`: Defines the length of time you want to predict out based on the periodicity of the training data. For example if you have training data with daily time grains, you define how far out in days you want the model to train for.
204+
1. `grain_column_names`: Defines the name of columns which contain individual time series data in your training data. For example, if you are forecasting sales of a particular brand by store, you would define store and brand columns as your grain columns. Separate time-series and forecasts will be created for each grain/grouping.
232205

233-
See example of these settings being used below, notebook example is available [here](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb).
206+
For examples of the settings used below, see the [sample notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb).
234207

235208
```python
236209
# Setting Store and Brand as grains for training.
@@ -339,11 +312,11 @@ run = experiment.submit(automl_config, show_output=True)
339312
>Setting `show_output` to `True` results in output being shown on the console.
340313
341314
### Exit Criteria
342-
There a few options you can define to complete your experiment.
343-
1. No Criteria - If you do not define any exit parameters the experiment will continue until no further progress is made on your primary metric.
344-
1. Number of iterations - You define the number of iterations for the experiment to run. You can optional add iteration_timeout_minutes to define a time limit in minutes per each iteration.
345-
1. Exit after a length of time - Using experiment_timeout_minutes in your settings you can define how long in minutes should an experiment continue in run.
346-
1. Exit after a score has been reached - Using experiment_exit_score you can choose to complete the experiment after a score based on your primary metric has been reached.
315+
There are a few options you can define to end your experiment.
316+
1. No Criteria: If you do not define any exit parameters the experiment will continue until no further progress is made on your primary metric.
317+
1. Number of iterations: You define the number of iterations for the experiment to run. You can optionally add `iteration_timeout_minutes` to define a time limit in minutes per each iteration.
318+
1. Exit after a length of time: Using `experiment_timeout_minutes` in your settings allows you to define how long in minutes should an experiment continue in run.
319+
1. Exit after a score has been reached: Using `experiment_exit_score` will complete the experiment after a primary metric score has been reached.
347320

348321
### Explore model metrics
349322

0 commit comments

Comments
 (0)