Skip to content

Commit 6987ec3

Browse files
Minor edits.
1 parent 49215d4 commit 6987ec3

File tree

1 file changed

+26
-26
lines changed

1 file changed

+26
-26
lines changed

articles/machine-learning/how-to-configure-auto-train.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ show_latex: true
1919

2020
[!INCLUDE [dev v2](includes/machine-learning-dev-v2.md)]
2121

22-
In this guide, learn how to set up an automated machine learning (AutoML) training job with the [Azure Machine Learning Python SDK v2](/python/api/overview/azure/ml/intro). AutoML picks an algorithm and hyperparameters for you and generates a model ready for deployment. This guide provides details of the various options that you can use to configure automated ML experiments.
22+
In this article, learn how to set up an automated machine learning (AutoML) training job with the [Azure Machine Learning Python SDK v2](/python/api/overview/azure/ml/intro). Automated ML picks an algorithm and hyperparameters for you and generates a model ready for deployment. This article provides details of the various options that you can use to configure automated machine learning experiments.
2323

2424
If you prefer a no-code experience, you can also [Set up no-code Automated ML training for tabular data with the studio UI](how-to-use-automated-ml-for-ml-models.md).
2525

@@ -35,7 +35,7 @@ To use the **SDK** information, install the Azure Machine Learning [SDK v2 for P
3535
To install the SDK, you can either:
3636

3737
- Create a compute instance, which already has the latest Azure Machine Learning Python SDK and is configured for ML workflows. For more information, see [Create an Azure Machine Learning compute instance](how-to-create-compute-instance.md).
38-
- Install the SDK on your local machine
38+
- Install the SDK on your local machine.
3939

4040
# [Azure CLI](#tab/azurecli)
4141

@@ -73,7 +73,7 @@ except Exception as ex:
7373

7474
# [Azure CLI](#tab/azurecli)
7575

76-
In the CLI, begin by logging into your Azure account. If you account is associated with multiple subscriptions, you need to [set the subscription](/cli/azure/manage-azure-subscriptions-azure-cli#change-the-active-subscription).
76+
In the CLI, begin by signing into your Azure account. If you account is associated with multiple subscriptions, you need to [set the subscription](/cli/azure/manage-azure-subscriptions-azure-cli#change-the-active-subscription).
7777

7878
```azurecli
7979
az login
@@ -98,7 +98,7 @@ In order to provide training data in SDK v2, you need to upload it into the clou
9898
Requirements for loading data into an MLTable:
9999

100100
- Data must be in tabular form.
101-
- The value to predict, target column, must be in the data.
101+
- The value to predict, *target column*, must be in the data.
102102

103103
Training data must be accessible from the remote compute. Automated ML v2 (Python SDK and CLI/YAML) accepts MLTable data assets (v2). For backwards compatibility, it also supports v1 Tabular Datasets from v1, a registered Tabular Dataset, through the same input dataset properties. We recommend that you use MLTable, available in v2. In this example, the data is stored at the local path, *./train_data/bank_marketing_train_data.csv*.
104104

@@ -121,7 +121,7 @@ This code creates a new file, *./train_data/MLTable*, which contains the file fo
121121

122122
# [Azure CLI](#tab/azurecli)
123123

124-
The following YAML code is the definition of a MLTable that is placed in a local folder or a remote folder in the cloud, along with the data file, which is a *.csv* or Parquet file. In this case, write the YAML text to the local file, *./train_data/MLTable*.
124+
The following YAML code is the definition of a MLTable that is placed in a local folder or a remote folder in the cloud, along with the data file. The data file is a *.csv* or Parquet file. In this case, write the YAML text to the local file, *./train_data/MLTable*.
125125

126126
```yml
127127
$schema: https://azuremlschemas.azureedge.net/latest/MLTable.schema.json
@@ -144,20 +144,20 @@ For more information on MLTable, see [Working with tables in Azure Machine Learn
144144
145145
You can specify separate *training data and validation data sets*. Training data must be provided to the `training_data` parameter in the factory function of your automated machine learning job.
146146

147-
If you don't explicitly specify a `validation_data` or `n_cross_validation` parameter, AutoML applies default techniques to determine how validation is performed. This determination depends on the number of rows in the dataset assigned to your `training_data` parameter.
147+
If you don't explicitly specify a `validation_data` or `n_cross_validation` parameter, Automated ML applies default techniques to determine how validation is performed. This determination depends on the number of rows in the dataset assigned to your `training_data` parameter.
148148

149149
| Training data size | Validation technique |
150150
|:---|:-----|
151-
| **Larger than 20,000 rows** | Train/validation data split is applied. The default is to take 10% of the initial training data set as the validation set. In turn, that validation set is used for metrics calculation. |
151+
| **Larger than 20,000 rows** | Training and validation data split is applied. The default is to take 10% of the initial training data set as the validation set. In turn, that validation set is used for metrics calculation. |
152152
| **Smaller&nbsp;than&nbsp;or&nbsp;equal&nbsp;to&nbsp;20,000&nbsp;rows** | Cross-validation approach is applied. The default number of folds depends on the number of rows. <br> **If the dataset is fewer than 1,000 rows**, ten folds are used. <br> **If the rows are equal to or between 1,000 and 20,000**, three folds are used. |
153153

154154
## Compute to run experiment
155155

156-
Automated machine learning jobs with the Python SDK v2 (or CLI v2) are currently only supported on Azure Machine Learning remote compute (cluster or compute instance). For more information about creating compute with the Python SDKv2 or CLIv2, see [Train models with Azure Machine Learning CLI, SDK, and REST API](./how-to-train-model.md).
156+
Automated machine learning jobs with the Python SDK v2 (or CLI v2) are currently only supported on Azure Machine Learning remote compute cluster or compute instance. For more information about creating compute with the Python SDKv2 or CLIv2, see [Train models with Azure Machine Learning CLI, SDK, and REST API](./how-to-train-model.md).
157157

158158
## Configure your experiment settings
159159

160-
There are several options that you can use to configure your automated ML experiment. These configuration parameters are set in your task method. You can also set job training settings and [exit criteria](#exit-criteria) with the `training` and `limits` settings.
160+
There are several options that you can use to configure your automated machine learning experiment. These configuration parameters are set in your task method. You can also set job training settings and [exit criteria](#exit-criteria) with the `training` and `limits` settings.
161161

162162
The following example shows the required parameters for a classification task that specifies accuracy as the [primary metric](#primary-metric) and five cross-validation folds.
163163

@@ -239,9 +239,9 @@ training:
239239

240240
### Select your machine learning task type
241241

242-
Before you can submit your automated machine learning job, determine the kind of machine learning problem that you want to solve. This problem determines which function your job uses and what model algorithms it applies.
242+
Before you can submit your Automated ML job, determine the kind of machine learning problem that you want to solve. This problem determines which function your job uses and what model algorithms it applies.
243243

244-
AutoML supports different task types:
244+
Automated ML supports different task types:
245245

246246
- Tabular data based tasks
247247

@@ -309,7 +309,7 @@ To learn about the specific definitions of these metrics, see [Evaluate automate
309309

310310
#### Metrics for classification multi-class scenarios
311311

312-
These metrics apply for all classification scenarios, including tabular data, images or computer-vision, and NLP-Text.
312+
These metrics apply for all classification scenarios, including tabular data, images or computer-vision, and natural language processing text (NLP-Text).
313313

314314
Threshold-dependent metrics, like `accuracy`, `recall_score_weighted`, `norm_macro_recall`, and `precision_score_weighted` might not optimize as well for datasets that are small, have large class skew (class imbalance), or when the expected metric value is very close to 0.0 or 1.0. In those cases, `AUC_weighted` can be a better choice for the primary metric. After automated machine learning completes, you can choose the winning model based on the metric best suited to your business needs.
315315

@@ -342,7 +342,7 @@ The main difference between `r2_score` and `normalized_root_mean_squared_error`
342342

343343
If the rank, instead of the exact value, is of interest, `spearman_correlation` can be a better choice. It measures the rank correlation between real values and predictions.
344344

345-
AutoML doesn't currently support any primary metrics that measure *relative* difference between predictions and observations. The metrics `r2_score`, `normalized_mean_absolute_error`, and `normalized_root_mean_squared_error` are all measures of absolute difference. For example, if a prediction differs from an observation by 10 units, these metrics compute the same value if the observation is 20 units or 20,000 units. In contrast, a percentage difference, which is a relative measure, gives errors of 50% and 0.05%, respectively. To optimize for relative difference, you can run AutoML with a supported primary metric and then select the model with the best `mean_absolute_percentage_error` or `root_mean_squared_log_error`. These metrics are undefined when any observation values are zero, so they might not always be good choices.
345+
Automated ML doesn't currently support any primary metrics that measure *relative* difference between predictions and observations. The metrics `r2_score`, `normalized_mean_absolute_error`, and `normalized_root_mean_squared_error` are all measures of absolute difference. For example, if a prediction differs from an observation by 10 units, these metrics compute the same value if the observation is 20 units or 20,000 units. In contrast, a percentage difference, which is a relative measure, gives errors of 50% and 0.05%, respectively. To optimize for relative difference, you can run Automated ML with a supported primary metric and then select the model with the best `mean_absolute_percentage_error` or `root_mean_squared_log_error`. These metrics are undefined when any observation values are zero, so they might not always be good choices.
346346

347347
| Metric | Example use cases |
348348
|:------ |:------- |
@@ -382,7 +382,7 @@ The following table shows the accepted settings for featurization.
382382

383383
| Featurization Configuration | Description |
384384
|:------------- |:------------- |
385-
| `"mode": 'auto'` | Indicates that, as part of preprocessing, [data guardrails and featurization steps](./v1/how-to-configure-auto-features.md#featurization) are performed automatically. **Default setting**. |
385+
| `"mode": 'auto'` | Indicates that, as part of preprocessing, [data guardrails and featurization steps](./v1/how-to-configure-auto-features.md#featurization) are performed automatically. This value is the default setting. |
386386
| `"mode": 'off'` | Indicates featurization step shouldn't be done automatically. |
387387
| `"mode":`&nbsp;`'custom'` | Indicates customized featurization step should be used. |
388388

@@ -506,23 +506,23 @@ az ml job show -n $run_id --web
506506

507507
### Multiple child runs on clusters
508508

509-
AutoML experiment child runs can be performed on a cluster that is already running another experiment. However, the timing depends on how many nodes the cluster has, and if those nodes are available to run a different experiment.
509+
Automated ML experiment child runs can be performed on a cluster that is already running another experiment. However, the timing depends on how many nodes the cluster has, and if those nodes are available to run a different experiment.
510510

511-
Each node in the cluster acts as an individual virtual machine (VM) that can accomplish a single training run. For automated ML, this fact means a child run. If all the nodes are busy, a new experiment is queued. If there are free nodes, the new experiment runs child runs in parallel in the available nodes or virtual machines.
511+
Each node in the cluster acts as an individual virtual machine (VM) that can accomplish a single training run. For Automated ML, this fact means a child run. If all the nodes are busy, a new experiment is queued. If there are free nodes, the new experiment runs child runs in parallel in the available nodes or virtual machines.
512512

513513
To help manage child runs and when they can be performed, we recommend that you create a dedicated cluster per experiment, and match the number of `max_concurrent_iterations` of your experiment to the number of nodes in the cluster. This way, you use all the nodes of the cluster at the same time with the number of concurrent child runs and iterations that you want.
514514

515515
Configure `max_concurrent_iterations` in the `limits` configuration. If it isn't configured, then by default only one concurrent child run/iteration is allowed per experiment. For a compute instance, `max_concurrent_trials` can be set to be the same as number of cores on the compute instance virtual machine.
516516

517517
## Explore models and metrics
518518

519-
AutoML offers options for you to monitor and evaluate your training results.
519+
Automated ML offers options for you to monitor and evaluate your training results.
520520

521521
- For definitions and examples of the performance charts and metrics provided for each run, see [Evaluate automated machine learning experiment results](how-to-understand-automated-ml.md).
522522

523523
- To get a featurization summary and understand what features were added to a particular model, see [Featurization transparency](./v1/how-to-configure-auto-features.md#featurization-transparency).
524524

525-
From Azure Machine Learning UI at the model's page, you can also view the hyper-parameters used when you train a particular model and also view and customize the internal model's training code used.
525+
From the Azure Machine Learning UI at the model's page, you can also view the hyper-parameters used when you train a particular model and also view and customize the internal model's training code used.
526526

527527
## Register and deploy models
528528

@@ -533,9 +533,9 @@ After you test a model and confirm you want to use it in production, you can reg
533533

534534
## Use AutoML in pipelines
535535

536-
To use AutoML in your machine learning operations workflows, you can add AutoML Job steps to your [Azure Machine Learning Pipelines](./how-to-create-component-pipeline-python.md). This approach allows you to automate your entire workflow by hooking up your data preparation scripts to AutoML. Then register and validate the resulting best model.
536+
To use Automated ML in your machine learning operations workflows, you can add AutoML Job steps to your [Azure Machine Learning Pipelines](./how-to-create-component-pipeline-python.md). This approach allows you to automate your entire workflow by hooking up your data preparation scripts to Automated ML. Then register and validate the resulting best model.
537537

538-
This code is a [sample pipeline](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-classification-bankmarketing-in-pipeline) with an AutoML classification component and a command component that shows the resulting output. The code references the inputs (training and validation data) and the outputs (best model) in different steps.
538+
This code is a [sample pipeline](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/automl-classification-bankmarketing-in-pipeline) with an Automated ML classification component and a command component that shows the resulting output. The code references the inputs (training and validation data) and the outputs (best model) in different steps.
539539

540540
# [Python SDK](#tab/python)
541541

@@ -594,7 +594,7 @@ returned_pipeline_job
594594
595595
```
596596

597-
For more examples on how to include AutoML in your pipelines, see the [examples repository](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/).
597+
For more examples on how to include Automated ML in your pipelines, see the [examples repository](https://github.com/Azure/azureml-examples/tree/main/sdk/python/jobs/pipelines/1h_automl_in_pipeline/).
598598

599599
# [Azure CLI](#tab/azurecli)
600600

@@ -659,17 +659,17 @@ Now, you launch the pipeline run using the following command. The pipeline confi
659659

660660
## Use AutoML at scale: distributed training
661661

662-
For large data scenarios, AutoML supports distributed training for a limited set of models:
662+
For large data scenarios, Automated ML supports distributed training for a limited set of models:
663663

664664
| Distributed algorithm | Supported tasks | Data size limit (approximate) |
665665
|:--|:--|:-- |
666-
|[LightGBM](https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html) | Classification, regression | 1 TB |
667-
|[TCNForecaster](concept-automl-forecasting-deep-learning.md#introduction-to-tcnforecaster) | Forecasting | 200 GB |
666+
| [LightGBM](https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html) | Classification, regression | 1 TB |
667+
| [TCNForecaster](concept-automl-forecasting-deep-learning.md#introduction-to-tcnforecaster) | Forecasting | 200 GB |
668668

669669
Distributed training algorithms automatically partition and distribute your data across multiple compute nodes for model training.
670670

671671
> [!NOTE]
672-
> Cross-validation, ensemble models, ONNX support, and code generation are not currently supported in the distributed training mode. Also, AutoML can make choices such as restricting available featurizers and sub-sampling data used for validation, explainability, and model evaluation.
672+
> Cross-validation, ensemble models, ONNX support, and code generation are not currently supported in the distributed training mode. Also, Automatic ML can make choices such as restricting available featurizers and sub-sampling data used for validation, explainability, and model evaluation.
673673

674674
### Distributed training for classification and regression
675675

@@ -726,7 +726,7 @@ To learn how distributed training works for forecasting tasks, see [forecasting
726726
|:-- |:--|
727727
| training_mode | Indicates training mode; `distributed` or `non_distributed`. Defaults to `non_distributed`. |
728728
| enable_dnn_training | Flag to enable deep neural network models. |
729-
| max_concurrent_trials | This is the maximum number of trial models to train in parallel. Defaults to 1. |
729+
| max_concurrent_trials | This value is the maximum number of trial models to train in parallel. Defaults to 1. |
730730
| max_nodes | The total number of nodes to use for training. This setting must be greater than or equal to 2. For forecasting tasks, each trial model is trained using $\text{max}\left(2, \text{floor}( \text{max\_nodes} / \text{max\_concurrent\_trials}) \right)$ nodes. |
731731

732732
The following code sample shows an example of these settings for a forecasting job:

0 commit comments

Comments
 (0)