Skip to content

Commit 0317407

Browse files
Merge pull request #266861 from sdgilley/sdg-freshness
freshness update set-up-training-targets
2 parents 0446096 + 170ec27 commit 0317407

File tree

1 file changed

+33
-24
lines changed

1 file changed

+33
-24
lines changed

articles/machine-learning/v1/how-to-set-up-training-targets.md

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.author: sgilley
88
ms.reviewer: sgilley
99
ms.service: machine-learning
1010
ms.subservice: training
11-
ms.date: 10/21/2021
11+
ms.date: 02/21/2024
1212
ms.topic: how-to
1313
ms.custom: UpdateFrequency5,sdkv1
1414
---
@@ -17,23 +17,24 @@ ms.custom: UpdateFrequency5,sdkv1
1717

1818
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
1919

20-
In this article, you learn how to configure and submit Azure Machine Learning jobs to train your models. Snippets of code explain the key parts of configuration and submission of a training script. Then use one of the [example notebooks](#notebook-examples) to find the full end-to-end working examples.
20+
In this article, you learn how to configure and submit Azure Machine Learning jobs to train your models. Snippets of code explain the key parts of configuration and submission of a training script. Then use one of the [example notebooks](#notebook-examples) to find the full end-to-end working examples.
2121

2222
When training, it is common to start on your local computer, and then later scale out to a cloud-based cluster. With Azure Machine Learning, you can run your script on various compute targets without having to change your training script.
2323

24-
All you need to do is define the environment for each compute target within a **script job configuration**. Then, when you want to run your training experiment on a different compute target, specify the job configuration for that compute.
24+
All you need to do is define the environment for each compute target within a **script job configuration**. Then, when you want to run your training experiment on a different compute target, specify the job configuration for that compute.
2525

2626
## Prerequisites
2727

2828
* If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/) today
29-
* The [Azure Machine Learning SDK for Python](/python/api/overview/azure/ml/install) (>= 1.13.0)
29+
* The [Azure Machine Learning SDK for Python (v1)](/python/api/overview/azure/ml/install) (>= 1.13.0)
3030
* An [Azure Machine Learning workspace](../how-to-manage-workspace.md), `ws`
31-
* A compute target, `my_compute_target`. [Create a compute target](../how-to-create-attach-compute-studio.md)
31+
* A compute target, `my_compute_target`. [Create a compute target](../how-to-create-attach-compute-studio.md)
3232

3333
## What's a script run configuration?
34+
3435
A [ScriptRunConfig](/python/api/azureml-core/azureml.core.scriptrunconfig) is used to configure the information necessary for submitting a training job as part of an experiment.
3536

36-
You submit your training experiment with a ScriptRunConfig object. This object includes the:
37+
You submit your training experiment with a ScriptRunConfig object. This object includes the:
3738

3839
* **source_directory**: The source directory that contains your training script
3940
* **script**: The training script to run
@@ -46,7 +47,7 @@ You submit your training experiment with a ScriptRunConfig object. This object
4647
The code pattern to submit a training job is the same for all types of compute targets:
4748

4849
1. Create an experiment to run
49-
1. Create an environment where the script will run
50+
1. Create an environment where the script runs
5051
1. Create a ScriptRunConfig, which specifies the compute target and environment
5152
1. Submit the job
5253
1. Wait for the job to complete
@@ -60,6 +61,8 @@ Or you can:
6061

6162
Create an [experiment](concept-azure-machine-learning-architecture.md#experiments) in your workspace. An experiment is a light-weight container that helps to organize job submissions and keep track of code.
6263

64+
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
65+
6366
```python
6467
from azureml.core import Experiment
6568

@@ -80,10 +83,12 @@ The example code in this article assumes that you have already created a compute
8083
## Create an environment
8184
Azure Machine Learning [environments](../concept-environments.md) are an encapsulation of the environment where your machine learning training happens. They specify the Python packages, Docker image, environment variables, and software settings around your training and scoring scripts. They also specify runtimes (Python, Spark, or Docker).
8285

83-
You can either define your own environment, or use an Azure Machine Learning curated environment. [Curated environments](../how-to-use-environments.md#use-a-curated-environment) are predefined environments that are available in your workspace by default. These environments are backed by cached Docker images which reduce the job preparation cost. See [Azure Machine Learning Curated Environments](../resource-curated-environments.md) for the full list of available curated environments.
86+
You can either define your own environment, or use an Azure Machine Learning curated environment. [Curated environments](../how-to-use-environments.md#use-a-curated-environment) are predefined environments that are available in your workspace by default. These environments are backed by cached Docker images, which reduce the job preparation cost. See [Azure Machine Learning Curated Environments](../resource-curated-environments.md) for the full list of available curated environments.
8487

8588
For a remote compute target, you can use one of these popular curated environments to start with:
8689

90+
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
91+
8792
```python
8893
from azureml.core import Workspace, Environment
8994

@@ -95,7 +100,9 @@ For more information and details about environments, see [Create & use software
95100

96101
### Local compute target
97102

98-
If your compute target is your **local machine**, you are responsible for ensuring that all the necessary packages are available in the Python environment where the script runs. Use `python.user_managed_dependencies` to use your current Python environment (or the Python on the path you specify).
103+
If your compute target is your **local machine**, you're responsible for ensuring that all the necessary packages are available in the Python environment where the script runs. Use `python.user_managed_dependencies` to use your current Python environment (or the Python on the path you specify).
104+
105+
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
99106

100107
```python
101108
from azureml.core import Environment
@@ -109,7 +116,9 @@ myenv.python.user_managed_dependencies = True
109116

110117
## Create the script job configuration
111118

112-
Now that you have a compute target (`my_compute_target`, see [Prerequisites](#prerequisites) and environment (`myenv`, see [Create an environment](#create-an-environment)), create a script job configuration that runs your training script (`train.py`) located in your `project_folder` directory:
119+
Now that you have a compute target (`my_compute_target`, see [Prerequisites,](#prerequisites) and environment (`myenv`, see [Create an environment](#create-an-environment)), create a script job configuration that runs your training script (`train.py`) located in your `project_folder` directory:
120+
121+
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
113122

114123
```python
115124
from azureml.core import ScriptRunConfig
@@ -119,33 +128,33 @@ src = ScriptRunConfig(source_directory=project_folder,
119128
compute_target=my_compute_target,
120129
environment=myenv)
121130

122-
# Set compute target
123-
# Skip this if you are running on your local computer
124-
script_run_config.run_config.target = my_compute_target
125131
```
126132

127-
If you do not specify an environment, a default environment will be created for you.
133+
If you don't specify an environment, a default environment will be created for you.
128134

129-
If you have command-line arguments you want to pass to your training script, you can specify them via the **`arguments`** parameter of the ScriptRunConfig constructor, e.g. `arguments=['--arg1', arg1_val, '--arg2', arg2_val]`.
135+
If you have command-line arguments you want to pass to your training script, you can specify them via the **`arguments`** parameter of the ScriptRunConfig constructor, for example, `arguments=['--arg1', arg1_val, '--arg2', arg2_val]`.
130136

131-
If you want to override the default maximum time allowed for the job, you can do so via the **`max_run_duration_seconds`** parameter. The system will attempt to automatically cancel the job if it takes longer than this value.
137+
If you want to override the default maximum time allowed for the job, you can do so via the **`max_run_duration_seconds`** parameter. The system attempts to automatically cancel the job if it takes longer than this value.
132138

133139
### Specify a distributed job configuration
140+
134141
If you want to run a [distributed training](../how-to-train-distributed-gpu.md) job, provide the distributed job-specific config to the **`distributed_job_config`** parameter. Supported config types include [MpiConfiguration](/python/api/azureml-core/azureml.core.runconfig.mpiconfiguration), [TensorflowConfiguration](/python/api/azureml-core/azureml.core.runconfig.tensorflowconfiguration), and [PyTorchConfiguration](/python/api/azureml-core/azureml.core.runconfig.pytorchconfiguration).
135142

136-
For more information and examples on running distributed Horovod, TensorFlow and PyTorch jobs, see:
143+
For more information and examples on running distributed Horovod, TensorFlow, and PyTorch jobs, see:
137144

138145
* [Distributed training of deep learning models on Azure](/azure/architecture/reference-architectures/ai/training-deep-learning)
139146

140147
## Submit the experiment
141148

149+
[!INCLUDE [sdk v1](../includes/machine-learning-sdk-v1.md)]
150+
142151
```python
143152
run = experiment.submit(config=src)
144153
run.wait_for_completion(show_output=True)
145154
```
146155

147156
> [!IMPORTANT]
148-
> When you submit the training job, a snapshot of the directory that contains your training scripts is created and sent to the compute target. It is also stored as part of the experiment in your workspace. If you change files and submit the job again, only the changed files will be uploaded.
157+
> When you submit the training job, a snapshot of the directory that contains your training scripts will be created and sent to the compute target. It is also stored as part of the experiment in your workspace. If you change files and submit the job again, only the changed files will be uploaded.
149158
>
150159
> [!INCLUDE [amlinclude-info](../includes/machine-learning-amlignore-gitignore.md)]
151160
>
@@ -157,7 +166,7 @@ run.wait_for_completion(show_output=True)
157166
>
158167
> To create artifacts during training (such as model files, checkpoints, data files, or plotted images) write these to the `./outputs` folder.
159168
>
160-
> Similarly, you can write any logs from your training job to the `./logs` folder. To utilize Azure Machine Learning's [TensorBoard integration](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/tensorboard/export-run-history-to-tensorboard/export-run-history-to-tensorboard.ipynb) make sure you write your TensorBoard logs to this folder. While your job is in progress, you will be able to launch TensorBoard and stream these logs. Later, you will also be able to restore the logs from any of your previous jobs.
169+
> Similarly, you can write any logs from your training job to the `./logs` folder. To utilize Azure Machine Learning's [TensorBoard integration](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/tensorboard/export-run-history-to-tensorboard/export-run-history-to-tensorboard.ipynb) make sure you write your TensorBoard logs to this folder. While your job is in progress, you will be able to launch TensorBoard and stream these logs. Later, you will also be able to restore the logs from any of your previous jobs.
161170
>
162171
> For example, to download a file written to the *outputs* folder to your local machine after your remote training job:
163172
> `run.download_file(name='outputs/my_output_file', output_file_path='my_destination_path')`
@@ -177,25 +186,25 @@ See these notebooks for examples of configuring jobs for various training scenar
177186

178187
## Troubleshooting
179188

180-
* **AttributeError: 'RoundTripLoader' object has no attribute 'comment_handling'**: This error comes from the new version (v0.17.5) of `ruamel-yaml`, an `azureml-core` dependency, that introduces a breaking change to `azureml-core`. In order to fix this error, please uninstall `ruamel-yaml` by running `pip uninstall ruamel-yaml` and installing a different version of `ruamel-yaml`; the supported versions are v0.15.35 to v0.17.4 (inclusive). You can do this by running `pip install "ruamel-yaml>=0.15.35,<0.17.5"`.
189+
* **AttributeError: 'RoundTripLoader' object has no attribute 'comment_handling'**: This error comes from the new version (v0.17.5) of `ruamel-yaml`, an `azureml-core` dependency, that introduces a breaking change to `azureml-core`. In order to fix this error, uninstall `ruamel-yaml` by running `pip uninstall ruamel-yaml` and installing a different version of `ruamel-yaml`; the supported versions are v0.15.35 to v0.17.4 (inclusive). You can do this by running `pip install "ruamel-yaml>=0.15.35,<0.17.5"`.
181190

182191

183192
* **Job fails with `jwt.exceptions.DecodeError`**: Exact error message: `jwt.exceptions.DecodeError: It is required that you pass in a value for the "algorithms" argument when calling decode()`.
184193

185194
Consider upgrading to the latest version of azureml-core: `pip install -U azureml-core`.
186195

187-
If you are running into this issue for local jobs, check the version of PyJWT installed in your environment where you are starting jobs. The supported versions of PyJWT are < 2.0.0. Uninstall PyJWT from the environment if the version is >= 2.0.0. You may check the version of PyJWT, uninstall and install the right version as follows:
196+
If . you're running into this issue for local jobs, check the version of PyJWT installed in your environment where . you're starting jobs. The supported versions of PyJWT are < 2.0.0. Uninstall PyJWT from the environment if the version is >= 2.0.0. You may check the version of PyJWT, uninstall, and install the right version as follows:
188197
1. Start a command shell, activate conda environment where azureml-core is installed.
189198
2. Enter `pip freeze` and look for `PyJWT`, if found, the version listed should be < 2.0.0
190199
3. If the listed version is not a supported version, `pip uninstall PyJWT` in the command shell and enter y for confirmation.
191200
4. Install using `pip install 'PyJWT<2.0.0'`
192201

193-
If you are submitting a user-created environment with your job, consider using the latest version of azureml-core in that environment. Versions >= 1.18.0 of azureml-core already pin PyJWT < 2.0.0. If you need to use a version of azureml-core < 1.18.0 in the environment you submit, make sure to specify PyJWT < 2.0.0 in your pip dependencies.
202+
If . you're submitting a user-created environment with your job, consider using the latest version of azureml-core in that environment. Versions >= 1.18.0 of azureml-core already pin PyJWT < 2.0.0. If you need to use a version of azureml-core < 1.18.0 in the environment you submit, make sure to specify PyJWT < 2.0.0 in your pip dependencies.
194203

195204

196-
* **ModuleErrors (No module named)**: If you are running into ModuleErrors while submitting experiments in Azure Machine Learning, the training script is expecting a package to be installed but it isn't added. Once you provide the package name, Azure Machine Learning installs the package in the environment used for your training job.
205+
* **ModuleErrors (No module named)**: If . you're running into ModuleErrors while submitting experiments in Azure Machine Learning, the training script is expecting a package to be installed but it isn't added. Once you provide the package name, Azure Machine Learning installs the package in the environment used for your training job.
197206

198-
If you are using Estimators to submit experiments, you can specify a package name via `pip_packages` or `conda_packages` parameter in the estimator based on from which source you want to install the package. You can also specify a yml file with all your dependencies using `conda_dependencies_file`or list all your pip requirements in a txt file using `pip_requirements_file` parameter. If you have your own Azure Machine Learning Environment object that you want to override the default image used by the estimator, you can specify that environment via the `environment` parameter of the estimator constructor.
207+
If . you're using Estimators to submit experiments, you can specify a package name via `pip_packages` or `conda_packages` parameter in the estimator based on from which source you want to install the package. You can also specify a yml file with all your dependencies using `conda_dependencies_file`or list all your pip requirements in a txt file using `pip_requirements_file` parameter. If you have your own Azure Machine Learning Environment object that you want to override the default image used by the estimator, you can specify that environment via the `environment` parameter of the estimator constructor.
199208

200209
Azure Machine Learning maintained docker images and their contents can be seen in [Azure Machine Learning Containers](https://github.com/Azure/AzureML-Containers).
201210
Framework-specific dependencies are listed in the respective framework documentation:

0 commit comments

Comments
 (0)