You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/service/how-to-train-tensorflow.md
+25-28Lines changed: 25 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,26 +14,27 @@ ms.custom: seodec18
14
14
15
15
# Train and register TensorFlow models at scale with Azure Machine Learning service
16
16
17
-
This article shows you how to train and register a TensorFlow model using Azure Machine Learning service. We'll be using the popular [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to classify handwritten digits using a deep neural network built using the [TensorFlow Python library](https://www.tensorflow.org/overview).
17
+
This article shows you how to train and register a TensorFlow model using Azure Machine Learning service. It uses the popular [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to classify handwritten digits using a deep neural network built using the [TensorFlow Python library](https://www.tensorflow.org/overview).
18
18
19
-
With Azure Machine Learning service, you'll be able to rapidly scale out your open-source training jobs using elastic cloud compute resources. You'll also be able track your training runs, version models, deploy models, and much more.
19
+
With Azure Machine Learning service, you can rapidly scale out open-source training jobs using elastic cloud compute resources. You can also track your training runs, version models, deploy models, and much more.
20
20
21
-
Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, you can build production-ready models with Azure Machine Learning service.
21
+
Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, Azure Machine Learning service can help you build production-ready models
22
22
23
23
## Prerequisites
24
24
25
-
- Install the [Azure Machine Learning SDK for Python](setup-create-workspace.md#sdk). Optional: create a `config.json` configuration file.
26
-
- Download the [sample script files](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow)`mnist-tf.py` and `utils.py`
25
+
- An Azure subscription. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today.
26
+
-[Install the Azure Machine Learning SDK for Python](setup-create-workspace.md#sdk)
27
+
-[Download the sample script files](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow)`mnist-tf.py` and `utils.py`
27
28
28
-
You can also find a completed [Jupyter Notebook version](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb) of this guide on our Github samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning and model deployment.
29
+
You can also find a completed [Jupyter Notebook version](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb) of this guide on GitHub samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning and model deployment.
29
30
30
31
## Set up the experiment
31
32
32
-
This section sets up the training experiment by loading the required python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts using the Python SDK.
33
+
This section sets up the training experiment by loading the required python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts.
33
34
34
35
### Import packages
35
36
36
-
First, we'll need to import the necessary Python libraries.
37
+
First, import the necessary Python libraries.
37
38
38
39
```Python
39
40
import os
@@ -52,18 +53,12 @@ from azureml.core.compute_target import ComputeTargetException
52
53
53
54
The [Azure Machine Learning service workspace](concept-workspace.md) is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create. In the Python SDK, you can access the workspace artifacts by creating a [`workspace`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) object.
54
55
55
-
If you completed the optional step in the [prerequisites section](#prerequisites), you can use `Workspace.from_config()` to quickly create a workspace object from the details stored in the config file.
56
+
Create a workspace by finding a value for the <azure-subscription-id> parameter in the [subscriptions list in the Azure portal](https://ms.portal.azure.com/#blade/Microsoft_Azure_Billing/SubscriptionsBlade). Use any subscription in which your role is owner or contributor. For more information on roles, see [Manage access to an Azure Machine Learning workspace](how-to-assign-roles.md) article
56
57
57
58
```Python
58
-
ws = Workspace.from_config()
59
-
```
60
-
61
-
You can also create a workspace explicitly:
62
-
63
-
```Python
64
-
ws = Workspace.create(name='<workspace-name>',
59
+
ws = Workspace.create(name='myworkspace',
65
60
subscription_id='<azure-subscription-id>',
66
-
resource_group='<choose-a-resource-group>',
61
+
resource_group='myresourcegroup',
67
62
create_resource_group=True,
68
63
location='<select-location>'# For example: 'eastus2'
The [datastore](how-to-access-data.md) is a place where data can be stored and accessed by mounting or copying the data to the compute target. Each workspace provides a default datastore. We'll upload our data and training scripts so that they can be easily accessed during training.
80
+
The [datastore](how-to-access-data.md) is a place where data can be stored and accessed by mounting or copying the data to the compute target. Each workspace provides a default datastore. Upload the data and training scripts to the datastore so that they can be easily accessed during training.
86
81
87
82
1. Download the MNIST dataset locally.
88
83
@@ -111,7 +106,7 @@ The [datastore](how-to-access-data.md) is a place where data can be stored and a
111
106
112
107
## Create a compute target
113
108
114
-
Create a compute target for your TensorFlow job to run on. In this example, we create a GPU-enabled Azure Machine Learning compute cluster. For a list of available training compute targets, see [this article](how-to-set-up-training-targets.md#compute-targets-for-training)
109
+
Create a compute target for your TensorFlow job to run on. In this example, create a GPU-enabled Azure Machine Learning compute cluster.
For more information on compute targets, see the [what is a compute target](concept-compute-target.md) article.
128
+
132
129
## Create a TensorFlow estimator
133
130
134
-
The [TensorFlow estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py) provides a simple way of launching a TensorFlow training job on a compute target. It will create a docker image that has TensorFlow installed.
131
+
The [TensorFlow estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py) provides a simple way of launching a TensorFlow training job on a compute target.
135
132
136
133
The TensorFlow estimator is implemented through the generic [`estimator`](https://docs.microsoft.com//python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py) class, which can be used to support any framework. For more information about training models using the generic estimator, see [train models with Azure Machine Learning using estimator](how-to-train-ml-models.md)
137
134
@@ -162,11 +159,11 @@ run = exp.submit(est)
162
159
run.wait_for_completion(show_output=True)
163
160
```
164
161
165
-
As the Run is executed, it will go through the following stages:
162
+
As the Run is executed, it goes through the following stages:
166
163
167
164
-**Preparing**: A docker image is created according to the TensorFlow estimator. The image is uploaded to the workspace's container registry and cached for later runs. Logs are also streamed to the run history and can be viewed to monitor progress.
168
165
169
-
-**Scaling**: The cluster will attempt to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.
166
+
-**Scaling**: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.
170
167
171
168
-**Running**: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the entry_script is executed. Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.
172
169
@@ -180,7 +177,7 @@ Once you've trained the model, you can register it to your workspace. Model regi
180
177
model = run.register_model(model_name='tf-dnn-mnist', model_path='outputs/model')
181
178
```
182
179
183
-
You can also download a local copy of the model by using the Run object. In the training script `mnist-tf.py`, a TensorFlow saver object persists the model to a local folder (local to the compute target). We can use the Run object to download a copy.
180
+
You can also download a local copy of the model by using the Run object. In the training script `mnist-tf.py`, a TensorFlow saver object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy.
184
181
185
182
```Python
186
183
# Create a model folder in the current directory
@@ -206,7 +203,7 @@ Azure Machine Learning service supports two methods of distributed training in T
206
203
207
204
[Horovod](https://github.com/uber/horovod) is an open-source framework for distributed training developed by Uber. It offers an easy path to distributed GPU TensorFlow jobs.
208
205
209
-
To use Horovod, specify `mpi`for the `distributed_training` parameter in the TensorFlow estimator constructor. Horovod will be installed for you to use in your training script.
206
+
To use Horovod, specify an [`MpiConfiguration`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.runconfig.mpiconfiguration?view=azure-ml-py) object for the `distributed_training` parameter in the TensorFlow constructor. This parameter ensures that Horovod library is installed for you to use in your training script.
You can also run [native distributed TensorFlow](https://www.tensorflow.org/deploy/distributed), which uses the parameter server model. In this method, you train across a cluster of parameter servers and workers. The workers calculate the gradients during training, while the parameter servers aggregate the gradients.
229
226
230
-
To use the parameter server method, specify `ps`for the `distributed_training` parameter in the TensorFlow estimator constructor.
227
+
To use the parameter server method, specify a [`TensorflowConfiguration`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.runconfig.tensorflowconfiguration?view=azure-ml-py) object for the `distributed_training` parameter in the TensorFlow constructor.
You'll also need the network addresses and ports of the cluster for the [`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec), so Azure Machine Learning sets the `TF_CONFIG` environment variable for you.
251
+
You also need the network addresses and ports of the cluster for the [`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec), so Azure Machine Learning sets the `TF_CONFIG` environment variable for you.
255
252
256
253
The `TF_CONFIG` environment variable is a JSON string. Here is an example of the variable for a parameter server:
257
254
@@ -266,7 +263,7 @@ TF_CONFIG='{
266
263
}'
267
264
```
268
265
269
-
For TensorFlow's high level [`tf.estimator`](https://www.tensorflow.org/api_docs/python/tf/estimator) API, TensorFlow will parse this `TF_CONFIG` variable and build the cluster spec for you.
266
+
For TensorFlow's high level [`tf.estimator`](https://www.tensorflow.org/api_docs/python/tf/estimator) API, TensorFlow parses the `TF_CONFIG` variable and builds the cluster spec for you.
270
267
271
268
For TensorFlow's lower-level core APIs for training, parse the `TF_CONFIG` variable and build the `tf.train.ClusterSpec` in your training code.
0 commit comments