You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this article, you learn how to configure and submit Azure Machine Learning jobs to train your models. Snippets of code explain the key parts of configuration and submission of a training script. Then use one of the [example notebooks](#notebook-examples) to find the full end-to-end working examples.
21
21
22
-
When training, it is common to start on your local computer, and then later scale out to a cloud-based cluster. With Azure Machine Learning, you can run your script on various compute targets without having to change your training script.
22
+
When training, it's common to start on your local computer, and then later scale out to a cloud-based cluster. With Azure Machine Learning, you can run your script on various compute targets without having to change your training script.
23
23
24
24
All you need to do is define the environment for each compute target within a **script job configuration**. Then, when you want to run your training experiment on a different compute target, specify the job configuration for that compute.
25
25
@@ -40,7 +40,7 @@ You submit your training experiment with a ScriptRunConfig object. This object i
40
40
***script**: The training script to run
41
41
***compute_target**: The compute target to run on
42
42
***environment**: The environment to use when running the script
43
-
*and some additional configurable options (see the [reference documentation](/python/api/azureml-core/azureml.core.scriptrunconfig) for more information)
43
+
*other configurable options (see the [reference documentation](/python/api/azureml-core/azureml.core.scriptrunconfig) for more information)
44
44
45
45
## Train your model
46
46
@@ -57,18 +57,6 @@ Or you can:
57
57
* Submit a HyperDrive run for [hyperparameter tuning](../how-to-tune-hyperparameters.md).
58
58
* Submit an experiment via the [VS Code extension](../tutorial-train-deploy-image-classification-model-vscode.md#train-the-model).
59
59
60
-
## Create an experiment
61
-
62
-
Create an [experiment](concept-azure-machine-learning-architecture.md#experiments) in your workspace. An experiment is a light-weight container that helps to organize job submissions and keep track of code.
@@ -77,7 +65,7 @@ Select the compute target where your training script will run on. If no compute
77
65
The example code in this article assumes that you have already created a compute target `my_compute_target` from the "Prerequisites" section.
78
66
79
67
>[!NOTE]
80
-
> - Azure Databricks is not supported as a compute target for model training. You can use Azure Databricks for data preparation and deployment tasks.
68
+
> - Azure Databricks isn't supported as a compute target for model training. You can use Azure Databricks for data preparation and deployment tasks.
81
69
> - To create and attach a compute target for training on Azure Arc-enabled Kubernetes cluster, see [Configure Azure Arc-enabled Machine Learning](../how-to-attach-kubernetes-anywhere.md)
Create an [experiment](concept-azure-machine-learning-architecture.md#experiments) in your workspace. An experiment is a light-weight container that helps to organize job submissions and keep track of code.
Now that you have a compute target (`my_compute_target`, see [Prerequisites,](#prerequisites) and environment (`myenv`, see [Create an environment](#create-an-environment)), create a script job configuration that runs your training script (`train.py`) located in your `project_folder` directory:
If you don't specify an environment, a default environment will be created for you.
134
+
If you don't specify an environment, a default environment is created for you.
134
135
135
136
If you have command-line arguments you want to pass to your training script, you can specify them via the **`arguments`** parameter of the ScriptRunConfig constructor, for example, `arguments=['--arg1', arg1_val, '--arg2', arg2_val]`.
> When you submit the training job, a snapshot of the directory that contains your training scripts will be created and sent to the compute target. It is also stored as part of the experiment in your workspace. If you change files and submit the job again, only the changed files will be uploaded.
158
+
> When you submit the training job, a snapshot of the directory that contains your training scripts is created and sent to the compute target. It's also stored as part of the experiment in your workspace. If you change files and submit the job again, only the changed files are uploaded.
> For more information about snapshots, see [Snapshots](concept-azure-machine-learning-architecture.md#snapshots).
162
163
163
164
> [!IMPORTANT]
164
165
> **Special Folders**
165
-
> Two folders, *outputs* and *logs*, receive special treatment by Azure Machine Learning. During training, when you write files to folders named *outputs* and *logs* that are relative to the root directory (`./outputs` and `./logs`, respectively), the files will automatically upload to your job history so that you have access to them once your job is finished.
166
+
> Two folders, *outputs* and *logs*, receive special treatment by Azure Machine Learning. During training, when you write files to folders named *outputs* and *logs* that are relative to the root directory (`./outputs` and `./logs`, respectively), the files automatically upload to your job history so that you have access to them once your job is finished.
166
167
>
167
-
> To create artifacts during training (such as model files, checkpoints, data files, or plotted images) write these to the `./outputs` folder.
168
+
> To create artifacts during training (such as model files, checkpoints, data files, or plotted images) write to the `./outputs` folder.
168
169
>
169
-
> Similarly, you can write any logs from your training job to the `./logs` folder. To utilize Azure Machine Learning's [TensorBoard integration](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/tensorboard/export-run-history-to-tensorboard/export-run-history-to-tensorboard.ipynb) make sure you write your TensorBoard logs to this folder. While your job is in progress, you will be able to launch TensorBoard and stream these logs. Later, you will also be able to restore the logs from any of your previous jobs.
170
+
> Similarly, you can write any logs from your training job to the `./logs` folder. To utilize Azure Machine Learning's [TensorBoard integration](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/track-and-monitor-experiments/tensorboard/export-run-history-to-tensorboard/export-run-history-to-tensorboard.ipynb), make sure you write your TensorBoard logs to this folder. While your job is in progress, you'll be able to launch TensorBoard and stream these logs. Later, you'll also be able to restore the logs from any of your previous jobs.
170
171
>
171
172
> For example, to download a file written to the *outputs* folder to your local machine after your remote training job:
@@ -186,17 +187,17 @@ See these notebooks for examples of configuring jobs for various training scenar
186
187
187
188
## Troubleshooting
188
189
189
-
***AttributeError: 'RoundTripLoader' object has no attribute 'comment_handling'**: This error comes from the new version (v0.17.5) of `ruamel-yaml`, an `azureml-core` dependency, that introduces a breaking change to `azureml-core`. In order to fix this error, uninstall `ruamel-yaml` by running `pip uninstall ruamel-yaml` and installing a different version of `ruamel-yaml`; the supported versions are v0.15.35 to v0.17.4 (inclusive). You can do this by running `pip install "ruamel-yaml>=0.15.35,<0.17.5"`.
190
+
***AttributeError: 'RoundTripLoader' object has no attribute 'comment_handling'**: This error comes from the new version (v0.17.5) of `ruamel-yaml`, an `azureml-core` dependency, that introduces a breaking change to `azureml-core`. In order to fix this error, uninstall `ruamel-yaml` by running `pip uninstall ruamel-yaml` and installing a different version of `ruamel-yaml`; the supported versions are v0.15.35 to v0.17.4 (inclusive). You can do so by running `pip install "ruamel-yaml>=0.15.35,<0.17.5"`.
190
191
191
192
192
193
***Job fails with `jwt.exceptions.DecodeError`**: Exact error message: `jwt.exceptions.DecodeError: It is required that you pass in a value for the "algorithms" argument when calling decode()`.
193
194
194
195
Consider upgrading to the latest version of azureml-core: `pip install -U azureml-core`.
195
196
196
-
If . you're running into this issue for local jobs, check the version of PyJWT installed in your environment where . you're starting jobs. The supported versions of PyJWT are < 2.0.0. Uninstall PyJWT from the environment if the version is >= 2.0.0. You may check the version of PyJWT, uninstall, and install the right version as follows:
197
+
If you run into this issue for local jobs, check the version of PyJWT installed in your environment where . you're starting jobs. The supported versions of PyJWT are < 2.0.0. Uninstall PyJWT from the environment if the version is >= 2.0.0. You may check the version of PyJWT, uninstall, and install the right version as follows:
197
198
1. Start a command shell, activate conda environment where azureml-core is installed.
198
199
2. Enter `pip freeze` and look for `PyJWT`, if found, the version listed should be < 2.0.0
199
-
3. If the listed version is not a supported version, `pip uninstall PyJWT` in the command shell and enter y for confirmation.
200
+
3. If the listed version isn't a supported version, `pip uninstall PyJWT` in the command shell and enter y for confirmation.
200
201
4. Install using `pip install 'PyJWT<2.0.0'`
201
202
202
203
If . you're submitting a user-created environment with your job, consider using the latest version of azureml-core in that environment. Versions >= 1.18.0 of azureml-core already pin PyJWT < 2.0.0. If you need to use a version of azureml-core < 1.18.0 in the environment you submit, make sure to specify PyJWT < 2.0.0 in your pip dependencies.
0 commit comments