You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-pytorch.md
+27-29Lines changed: 27 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ ms.subservice: training
8
8
ms.author: sgilley
9
9
author: sdgilley
10
10
ms.reviewer: balapv
11
-
ms.date: 01/26/2024
11
+
ms.date: 09/17/2024
12
12
ms.topic: how-to
13
13
ms.custom: sdkv2, update-code1
14
14
#Customer intent: As a Python PyTorch developer, I need to combine open-source with a cloud platform to train, evaluate, and deploy my deep learning models at scale.
In this article, you'll learn to train, hyperparameter tune, and deploy a [PyTorch](https://pytorch.org/) model using the Azure Machine Learning Python SDK v2.
21
+
In this article, you learn to train, hyperparameter tune, and deploy a [PyTorch](https://pytorch.org/) model using the Azure Machine Learning Python SDK v2.
22
22
23
-
You'll use example scripts to classify chicken and turkey images to build a deep learning neural network (DNN) based on [PyTorch's transfer learning tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html). Transfer learning is a technique that applies knowledge gained from solving one problem to a different but related problem. Transfer learning shortens the training process by requiring less data, time, and compute resources than training from scratch. To learn more about transfer learning, see [Deep learning vs. machine learning](./concept-deep-learning-vs-machine-learning.md#what-is-transfer-learning).
23
+
You use example scripts to classify chicken and turkey images to build a deep learning neural network (DNN) based on [PyTorch's transfer learning tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html). Transfer learning is a technique that applies knowledge gained from solving one problem to a different but related problem. Transfer learning shortens the training process by requiring less data, time, and compute resources than training from scratch. To learn more about transfer learning, see [Deep learning vs. machine learning](./concept-deep-learning-vs-machine-learning.md#what-is-transfer-learning).
24
24
25
-
Whether you're training a deep learning PyTorch model from the ground-up or you're bringing an existing model into the cloud, you can use Azure Machine Learning to scale out open-source training jobs using elastic cloud compute resources. You can build, deploy, version, and monitor production-grade models with Azure Machine Learning.
25
+
Whether you're training a deep learning PyTorch model from the ground-up or you're bringing an existing model into the cloud, use Azure Machine Learning to scale out open-source training jobs using elastic cloud compute resources. You can build, deploy, version, and monitor production-grade models with Azure Machine Learning.
26
26
27
27
## Prerequisites
28
28
@@ -37,8 +37,6 @@ Whether you're training a deep learning PyTorch model from the ground-up or you'
37
37
38
38
You can also find a completed [Jupyter notebook version](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/train-hyperparameter-tune-deploy-with-pytorch/train-hyperparameter-tune-deploy-with-pytorch.ipynb) of this guide on the GitHub samples page.
This section sets up the job for training by loading the required Python packages, connecting to a workspace, creating a compute resource to run a command job, and creating an environment to run the job.
@@ -51,7 +49,7 @@ We're using `DefaultAzureCredential` to get access to the workspace. This creden
51
49
52
50
If `DefaultAzureCredential` doesn't work for you, see [azure.identity package](/python/api/azure-identity/azure.identity) or [Set up authentication](how-to-setup-authentication.md?tabs=sdk) for more available credentials.
The result of running this script is a workspace handle that you can use to manage other resources and jobs.
76
74
@@ -83,19 +81,19 @@ Azure Machine Learning needs a compute resource to run a job. This resource can
83
81
84
82
In the following example script, we provision a Linux [compute cluster](./how-to-create-attach-compute-cluster.md?tabs=python). You can see the [Azure Machine Learning pricing](https://azure.microsoft.com/pricing/details/machine-learning/) page for the full list of VM sizes and prices. Since we need a GPU cluster for this example, let's pick a `STANDARD_NC6` model and create an Azure Machine Learning compute.
To run an Azure Machine Learning job, you need an environment. An Azure Machine Learning [environment](concept-environments.md) encapsulates the dependencies (such as software runtime and libraries) needed to run your machine learning training script on your compute resource. This environment is similar to a Python environment on your local machine.
91
89
92
-
Azure Machine Learning allows you to either use a curated (or ready-made) environment or create a custom environment using a Docker image or a Conda configuration. In this article, you reuse the curated Azure Machine Learning environment `AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu`. Use the latest version of this environment using the `@latest` directive.
90
+
Azure Machine Learning allows you to either use a curated (or ready-made) environment or create a custom environment using a Docker image or a Conda configuration. In this article, you reuse the curated Azure Machine Learning environment `AzureML-acpt-pytorch-2.2-cuda12.1`. Use the latest version of this environment using the `@latest` directive.
In this section, we begin by introducing the data for training. We then cover how to run a training job, using a training script that we've provided. You'll learn to build the training job by configuring the command for running the training script. Then, you'll submit the training job to run in Azure Machine Learning.
96
+
In this section, we begin by introducing the data for training. We then cover how to run a training job, using a training script that we've provided. You learn to build the training job by configuring the command for running the training script. Then, you submit the training job to run in Azure Machine Learning.
99
97
100
98
### Obtain the training data
101
99
@@ -115,9 +113,9 @@ An Azure Machine Learning `command` is a resource that specifies all the details
115
113
116
114
#### Configure the command
117
115
118
-
You'll use the general purpose `command` to run the training script and perform your desired tasks. Create a `command` object to specify the configuration details of your training job.
116
+
You use the general purpose `command` to run the training script and perform your desired tasks. Create a `command` object to specify the configuration details of your training job.
Once completed, the job registers a model in your workspace (as a result of training) and outputs a link for viewing the job in Azure Machine Learning studio.
137
135
@@ -156,7 +154,7 @@ To tune the model's hyperparameters, define the parameter space in which to sear
156
154
157
155
Since the training script uses a learning rate schedule to decay the learning rate every several epochs, you can tune the initial learning rate and the momentum parameters.
Then, you can configure sweep on the command job, using some sweep-specific parameters, such as the primary metric to watch and the sampling algorithm to use.
162
160
@@ -165,19 +163,19 @@ In the following code, we use random sampling to try different configuration set
165
163
We also define an early termination policy, the `BanditPolicy`, to terminate poorly performing runs early.
166
164
The `BanditPolicy` terminates any run that doesn't fall within the slack factor of our primary evaluation metric. You apply this policy every epoch (since we report our `best_val_acc` metric every epoch and `evaluation_interval`=1). Notice we delay the first policy evaluation until after the first 10 epochs (`delay_evaluation`=10).
@@ -193,13 +191,13 @@ For more information about deployment, see [Deploy and score a machine learning
193
191
194
192
As a first step to deploying your model, you need to create your online endpoint. The endpoint name must be unique in the entire Azure region. For this article, you create a unique name using a universally unique identifier (UUID).
> Expect this cleanup to take a bit of time to finish.
249
247
250
-
## Next steps
248
+
## Related content
251
249
252
250
In this article, you trained and registered a deep learning neural network using PyTorch on Azure Machine Learning. You also deployed the model to an online endpoint. See these other articles to learn more about Azure Machine Learning.
0 commit comments