You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-tensorflow.md
+20-20Lines changed: 20 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ ms.subservice: training
8
8
ms.author: balapv
9
9
author: balapv
10
10
ms.reviewer: sgilley
11
-
ms.date: 10/03/2022
11
+
ms.date: 04/03/2024
12
12
ms.topic: how-to
13
13
ms.custom: sdkv2
14
14
#Customer intent: As a TensorFlow developer, I need to combine open-source with a cloud platform to train, evaluate, and deploy my deep learning models at scale.
@@ -26,12 +26,12 @@ Whether you're developing a TensorFlow model from the ground-up or you're bringi
26
26
27
27
## Prerequisites
28
28
29
-
To benefit from this article, you'll need to:
29
+
To benefit from this article, you need to:
30
30
31
31
- Access an Azure subscription. If you don't have one already, [create a free account](https://azure.microsoft.com/free/).
32
32
- Run the code in this article using either an Azure Machine Learning compute instance or your own Jupyter notebook.
33
33
- Azure Machine Learning compute instance—no downloads or installation necessary
34
-
- Complete the [Create resources to get started](quickstart-create-resources.md) to create a dedicated notebook server pre-loaded with the SDK and the sample repository.
34
+
- Complete the [Create resources to get started](quickstart-create-resources.md)tutorial to create a dedicated notebook server preloaded with the SDK and the sample repository.
35
35
- In the samples deep learning folder on the notebook server, find a completed and expanded notebook by navigating to this directory: **v2 > sdk > python > jobs > single-step > tensorflow > train-hyperparameter-tune-deploy-with-tensorflow**.
36
36
- Your Jupyter notebook server
37
37
-[Install the Azure Machine Learning SDK (v2)](https://aka.ms/sdk-v2-install).
@@ -51,7 +51,7 @@ This section sets up the job for training by loading the required Python package
51
51
52
52
### Connect to the workspace
53
53
54
-
First, you'll need to connect to your Azure Machine Learning workspace. The [Azure Machine Learning workspace](concept-workspace.md) is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create when you use Azure Machine Learning.
54
+
First, you need to connect to your Azure Machine Learning workspace. The [Azure Machine Learning workspace](concept-workspace.md) is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create when you use Azure Machine Learning.
55
55
56
56
We're using `DefaultAzureCredential` to get access to the workspace. This credential should be capable of handling most Azure SDK authentication scenarios.
57
57
@@ -78,12 +78,12 @@ Next, get a handle to the workspace by providing your Subscription ID, Resource
The result of running this script is a workspace handle that you'll use to manage other resources and jobs.
81
+
The result of running this script is a workspace handle that you use to manage other resources and jobs.
82
82
83
83
> [!NOTE]
84
84
> - Creating `MLClient` will not connect the client to the workspace. The client initialization is lazy and will wait for the first time it needs to make a call. In this article, this will happen during compute creation.
85
85
86
-
### Create a compute resource to run the job
86
+
### Create a compute resource
87
87
88
88
Azure Machine Learning needs a compute resource to run a job. This resource can be single or multi-node machines with Linux or Windows OS, or a specific compute fabric like Spark.
89
89
@@ -93,17 +93,17 @@ In the following example script, we provision a Linux [`compute cluster`](./how-
93
93
94
94
### Create a job environment
95
95
96
-
To run an Azure Machine Learning job, you'll need an environment. An Azure Machine Learning [environment](concept-environments.md) encapsulates the dependencies (such as software runtime and libraries) needed to run your machine learning training script on your compute resource. This environment is similar to a Python environment on your local machine.
96
+
To run an Azure Machine Learning job, you need an environment. An Azure Machine Learning [environment](concept-environments.md) encapsulates the dependencies (such as software runtime and libraries) needed to run your machine learning training script on your compute resource. This environment is similar to a Python environment on your local machine.
97
97
98
98
Azure Machine Learning allows you to either use a curated (or ready-made) environment—useful for common training and inference scenarios—or create a custom environment using a Docker image or a Conda configuration.
99
99
100
-
In this article, you'll reuse the curated Azure Machine Learning environment `AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu`. You'll use the latest version of this environment using the `@latest` directive.
100
+
In this article, you reuse the curated Azure Machine Learning environment `AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu`. You use the latest version of this environment using the `@latest` directive.
In this section, we'll begin by introducing the data for training. We'll then cover how to run a training job, using a training script that we've provided. You'll learn to build the training job by configuring the command for running the training script. Then, you'll submit the training job to run in Azure Machine Learning.
106
+
In this section, we begin by introducing the data for training. We then cover how to run a training job, using a training script that we've provided. You learn to build the training job by configuring the command for running the training script. Then, you submit the training job to run in Azure Machine Learning.
107
107
108
108
### Obtain the training data
109
109
You'll use data from the Modified National Institute of Standards and Technology (MNIST) database of handwritten digits. This data is sourced from Yan LeCun's website and stored in an Azure storage account.
@@ -121,7 +121,7 @@ The provided training script does the following:
121
121
- trains a model, using the data; and
122
122
- returns the output model.
123
123
124
-
During the pipeline run, you'll use MLFlow to log the parameters and metrics. To learn how to enable MLFlow tracking, see [Track ML experiments and models with MLflow](how-to-use-mlflow-cli-runs.md).
124
+
During the pipeline run, you use MLFlow to log the parameters and metrics. To learn how to enable MLFlow tracking, see [Track ML experiments and models with MLflow](how-to-use-mlflow-cli-runs.md).
125
125
126
126
In the training script `tf_mnist.py`, we create a simple deep neural network (DNN). This DNN has:
127
127
@@ -133,14 +133,14 @@ In the training script `tf_mnist.py`, we create a simple deep neural network (DN
133
133
134
134
### Build the training job
135
135
136
-
Now that you have all the assets required to run your job, it's time to build it using the Azure Machine Learning Python SDK v2. For this example, we'll be creating a `command`.
136
+
Now that you have all the assets required to run your job, it's time to build it using the Azure Machine Learning Python SDK v2. For this example, we are creating a `command`.
137
137
138
138
An Azure Machine Learning `command` is a resource that specifies all the details needed to execute your training code in the cloud. These details include the inputs and outputs, type of hardware to use, software to install, and how to run your code. The `command` contains information to execute a single command.
139
139
140
140
141
141
#### Configure the command
142
142
143
-
You'll use the general purpose `command` to run the training script and perform your desired tasks. Create a `Command` object to specify the configuration details of your training job.
143
+
You use the general purpose `command` to run the training script and perform your desired tasks. Create a `Command` object to specify the configuration details of your training job.
Then, you'll configure sweep on the command job, using some sweep-specific parameters, such as the primary metric to watch and the sampling algorithm to use.
185
+
Then, you configure sweep on the command job, using some sweep-specific parameters, such as the primary metric to watch and the sampling algorithm to use.
186
186
187
187
In the following code, we use random sampling to try different configuration sets of hyperparameters in an attempt to maximize our primary metric, `validation_acc`.
188
188
189
-
We also define an early termination policy—the `BanditPolicy`. This policy operates by checking the job every two iterations. If the primary metric, `validation_acc`, falls outside the top ten percent range, Azure Machine Learning will terminate the job. This saves the model from continuing to explore hyperparameters that show no promise of helping to reach the target metric.
189
+
We also define an early termination policy—the `BanditPolicy`. This policy operates by checking the job every two iterations. If the primary metric, `validation_acc`, falls outside the top 10 percent range, Azure Machine Learning terminates the job. This saves the model from continuing to explore hyperparameters that show no promise of helping to reach the target metric.
@@ -209,31 +209,31 @@ You can then register this model.
209
209
210
210
## Deploy the model as an online endpoint
211
211
212
-
After you've registered your model, you can deploy it as an [online endpoint](concept-endpoints.md)—that is, as a web service in the Azure cloud.
212
+
After you register your model, you can deploy it as an [online endpoint](concept-endpoints.md)—that is, as a web service in the Azure cloud.
213
213
214
-
To deploy a machine learning service, you'll typically need:
214
+
To deploy a machine learning service, you typically need:
215
215
- The model assets that you want to deploy. These assets include the model's file and metadata that you already registered in your training job.
216
216
- Some code to run as a service. The code executes the model on a given input request (an entry script). This entry script receives data submitted to a deployed web service and passes it to the model. After the model processes the data, the script returns the model's response to the client. The script is specific to your model and must understand the data that the model expects and returns. When you use an MLFlow model, Azure Machine Learning automatically creates this script for you.
217
217
218
218
For more information about deployment, see [Deploy and score a machine learning model with managed online endpoint using Python SDK v2](how-to-deploy-managed-online-endpoint-sdk-v2.md).
219
219
220
220
### Create a new online endpoint
221
221
222
-
As a first step to deploying your model, you need to create your online endpoint. The endpoint name must be unique in the entire Azure region. For this article, you'll create a unique name using a universally unique identifier (UUID).
222
+
As a first step to deploying your model, you need to create your online endpoint. The endpoint name must be unique in the entire Azure region. For this article, you create a unique name using a universally unique identifier (UUID).
After you've created the endpoint, you can deploy the model with the entry script. An endpoint can have multiple deployments. Using rules, the endpoint can then direct traffic to these deployments.
235
235
236
-
In the following code, you'll create a single deployment that handles 100% of the incoming traffic. We've specified an arbitrary color name (*tff-blue*) for the deployment. You could also use any other name such as *tff-green* or *tff-red* for the deployment.
236
+
In the following code, you create a single deployment that handles 100% of the incoming traffic. We use an arbitrary color name (*tff-blue*) for the deployment. You could also use any other name such as *tff-green* or *tff-red* for the deployment.
237
237
The code to deploy the model to the endpoint does the following:
238
238
239
239
- deploys the best version of the model that you registered earlier;
@@ -247,7 +247,7 @@ The code to deploy the model to the endpoint does the following:
247
247
248
248
### Test the deployment with a sample query
249
249
250
-
Now that you've deployed the model to the endpoint, you can predict the output of the deployed model, using the `invoke` method on the endpoint. To run the inference, use the sample request file `sample-request.json` from the *request* folder.
250
+
After you deploy the model to the endpoint, you can predict the output of the deployed model, using the `invoke` method on the endpoint. To run the inference, use the sample request file `sample-request.json` from the *request* folder.
0 commit comments