Skip to content

Commit 4186748

Browse files
authored
Merge pull request #79055 from PeterCLu/plu-amls-tf-patch
[AMLs] TensorFlow estimator patch
2 parents 18b2b7f + 37ed649 commit 4186748

File tree

1 file changed

+25
-28
lines changed

1 file changed

+25
-28
lines changed

articles/machine-learning/service/how-to-train-tensorflow.md

Lines changed: 25 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -14,26 +14,27 @@ ms.custom: seodec18
1414

1515
# Train and register TensorFlow models at scale with Azure Machine Learning service
1616

17-
This article shows you how to train and register a TensorFlow model using Azure Machine Learning service. We'll be using the popular [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to classify handwritten digits using a deep neural network built using the [TensorFlow Python library](https://www.tensorflow.org/overview).
17+
This article shows you how to train and register a TensorFlow model using Azure Machine Learning service. It uses the popular [MNIST dataset](http://yann.lecun.com/exdb/mnist/) to classify handwritten digits using a deep neural network built using the [TensorFlow Python library](https://www.tensorflow.org/overview).
1818

19-
With Azure Machine Learning service, you'll be able to rapidly scale out your open-source training jobs using elastic cloud compute resources. You'll also be able track your training runs, version models, deploy models, and much more.
19+
With Azure Machine Learning service, you can rapidly scale out open-source training jobs using elastic cloud compute resources. You can also track your training runs, version models, deploy models, and much more.
2020

21-
Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, you can build production-ready models with Azure Machine Learning service.
21+
Whether you're developing a TensorFlow model from the ground-up or you're bringing an existing model into the cloud, Azure Machine Learning service can help you build production-ready models
2222

2323
## Prerequisites
2424

25-
- Install the [Azure Machine Learning SDK for Python](setup-create-workspace.md#sdk). Optional: create a `config.json` configuration file.
26-
- Download the [sample script files](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow) `mnist-tf.py` and `utils.py`
25+
- An Azure subscription. Try the [free or paid version of Azure Machine Learning service](https://aka.ms/AMLFree) today.
26+
- [Install the Azure Machine Learning SDK for Python](setup-create-workspace.md#sdk)
27+
- [Download the sample script files](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow) `mnist-tf.py` and `utils.py`
2728

28-
You can also find a completed [Jupyter Notebook version](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb) of this guide on our Github samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning and model deployment.
29+
You can also find a completed [Jupyter Notebook version](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb) of this guide on GitHub samples page. The notebook includes expanded sections covering intelligent hyperparameter tuning and model deployment.
2930

3031
## Set up the experiment
3132

32-
This section sets up the training experiment by loading the required python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts using the Python SDK.
33+
This section sets up the training experiment by loading the required python packages, initializing a workspace, creating an experiment, and uploading the training data and training scripts.
3334

3435
### Import packages
3536

36-
First, we'll need to import the necessary Python libraries.
37+
First, import the necessary Python libraries.
3738

3839
```Python
3940
import os
@@ -52,18 +53,12 @@ from azureml.core.compute_target import ComputeTargetException
5253

5354
The [Azure Machine Learning service workspace](concept-workspace.md) is the top-level resource for the service. It provides you with a centralized place to work with all the artifacts you create. In the Python SDK, you can access the workspace artifacts by creating a [`workspace`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) object.
5455

55-
If you completed the optional step in the [prerequisites section](#prerequisites), you can use `Workspace.from_config()` to quickly create a workspace object from the details stored in the config file.
56+
Create a workspace by finding a value for the <azure-subscription-id> parameter in the [subscriptions list in the Azure portal](https://ms.portal.azure.com/#blade/Microsoft_Azure_Billing/SubscriptionsBlade). Use any subscription in which your role is owner or contributor. For more information on roles, see [Manage access to an Azure Machine Learning workspace](how-to-assign-roles.md) article
5657

5758
```Python
58-
ws = Workspace.from_config()
59-
```
60-
61-
You can also create a workspace explicitly:
62-
63-
```Python
64-
ws = Workspace.create(name='<workspace-name>',
59+
ws = Workspace.create(name='myworkspace',
6560
subscription_id='<azure-subscription-id>',
66-
resource_group='<choose-a-resource-group>',
61+
resource_group='myresourcegroup',
6762
create_resource_group=True,
6863
location='<select-location>' # For example: 'eastus2'
6964
)
@@ -82,7 +77,7 @@ exp = Experiment(workspace=ws, name='tf-mnist')
8277

8378
### Upload dataset and scripts
8479

85-
The [datastore](how-to-access-data.md) is a place where data can be stored and accessed by mounting or copying the data to the compute target. Each workspace provides a default datastore. We'll upload our data and training scripts so that they can be easily accessed during training.
80+
The [datastore](how-to-access-data.md) is a place where data can be stored and accessed by mounting or copying the data to the compute target. Each workspace provides a default datastore. Upload the data and training scripts to the datastore so that they can be easily accessed during training.
8681

8782
1. Download the MNIST dataset locally.
8883

@@ -111,7 +106,7 @@ The [datastore](how-to-access-data.md) is a place where data can be stored and a
111106

112107
## Create a compute target
113108

114-
Create a compute target for your TensorFlow job to run on. In this example, we create a GPU-enabled Azure Machine Learning compute cluster. For a list of available training compute targets, see [this article](how-to-set-up-training-targets.md#compute-targets-for-training)
109+
Create a compute target for your TensorFlow job to run on. In this example, create a GPU-enabled Azure Machine Learning compute cluster.
115110

116111
```Python
117112
cluster_name = "gpucluster"
@@ -129,9 +124,11 @@ except ComputeTargetException:
129124
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
130125
```
131126

127+
For more information on compute targets, see the [what is a compute target](concept-compute-target.md) article.
128+
132129
## Create a TensorFlow estimator
133130

134-
The [TensorFlow estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py) provides a simple way of launching a TensorFlow training job on a compute target. It will create a docker image that has TensorFlow installed.
131+
The [TensorFlow estimator](https://docs.microsoft.com/python/api/azureml-train-core/azureml.train.dnn.tensorflow?view=azure-ml-py) provides a simple way of launching a TensorFlow training job on a compute target.
135132

136133
The TensorFlow estimator is implemented through the generic [`estimator`](https://docs.microsoft.com//python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py) class, which can be used to support any framework. For more information about training models using the generic estimator, see [train models with Azure Machine Learning using estimator](how-to-train-ml-models.md)
137134

@@ -162,11 +159,11 @@ run = exp.submit(est)
162159
run.wait_for_completion(show_output=True)
163160
```
164161

165-
As the Run is executed, it will go through the following stages:
162+
As the Run is executed, it goes through the following stages:
166163

167164
- **Preparing**: A docker image is created according to the TensorFlow estimator. The image is uploaded to the workspace's container registry and cached for later runs. Logs are also streamed to the run history and can be viewed to monitor progress.
168165

169-
- **Scaling**: The cluster will attempt to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.
166+
- **Scaling**: The cluster attempts to scale up if the Batch AI cluster requires more nodes to execute the run than are currently available.
170167

171168
- **Running**: All scripts in the script folder are uploaded to the compute target, data stores are mounted or copied, and the entry_script is executed. Outputs from stdout and the ./logs folder are streamed to the run history and can be used to monitor the run.
172169

@@ -180,7 +177,7 @@ Once you've trained the model, you can register it to your workspace. Model regi
180177
model = run.register_model(model_name='tf-dnn-mnist', model_path='outputs/model')
181178
```
182179

183-
You can also download a local copy of the model by using the Run object. In the training script `mnist-tf.py`, a TensorFlow saver object persists the model to a local folder (local to the compute target). We can use the Run object to download a copy.
180+
You can also download a local copy of the model by using the Run object. In the training script `mnist-tf.py`, a TensorFlow saver object persists the model to a local folder (local to the compute target). You can use the Run object to download a copy.
184181

185182
```Python
186183
# Create a model folder in the current directory
@@ -206,7 +203,7 @@ Azure Machine Learning service supports two methods of distributed training in T
206203

207204
[Horovod](https://github.com/uber/horovod) is an open-source framework for distributed training developed by Uber. It offers an easy path to distributed GPU TensorFlow jobs.
208205

209-
To use Horovod, specify `mpi` for the `distributed_training` parameter in the TensorFlow estimator constructor. Horovod will be installed for you to use in your training script.
206+
To use Horovod, specify an [`MpiConfiguration`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.runconfig.mpiconfiguration?view=azure-ml-py) object for the `distributed_training` parameter in the TensorFlow constructor. This parameter ensures that Horovod library is installed for you to use in your training script.
210207

211208
```Python
212209
from azureml.train.dnn import TensorFlow
@@ -227,7 +224,7 @@ estimator= TensorFlow(source_directory=project_folder,
227224

228225
You can also run [native distributed TensorFlow](https://www.tensorflow.org/deploy/distributed), which uses the parameter server model. In this method, you train across a cluster of parameter servers and workers. The workers calculate the gradients during training, while the parameter servers aggregate the gradients.
229226

230-
To use the parameter server method, specify `ps` for the `distributed_training` parameter in the TensorFlow estimator constructor.
227+
To use the parameter server method, specify a [`TensorflowConfiguration`](https://docs.microsoft.com/python/api/azureml-core/azureml.core.runconfig.tensorflowconfiguration?view=azure-ml-py) object for the `distributed_training` parameter in the TensorFlow constructor.
231228

232229
```Python
233230
from azureml.train.dnn import TensorFlow
@@ -242,7 +239,7 @@ estimator= TensorFlow(source_directory=project_folder,
242239
entry_script='script.py',
243240
node_count=2,
244241
process_count_per_node=1,
245-
distributed_backend=distributed_training,
242+
distributed_training=distributed_training,
246243
use_gpu=True)
247244

248245
# submit the TensorFlow job
@@ -251,7 +248,7 @@ run = exp.submit(tf_est)
251248

252249
#### Define cluster specifications in 'TF_CONFIG`
253250

254-
You'll also need the network addresses and ports of the cluster for the [`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec), so Azure Machine Learning sets the `TF_CONFIG` environment variable for you.
251+
You also need the network addresses and ports of the cluster for the [`tf.train.ClusterSpec`](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec), so Azure Machine Learning sets the `TF_CONFIG` environment variable for you.
255252

256253
The `TF_CONFIG` environment variable is a JSON string. Here is an example of the variable for a parameter server:
257254

@@ -266,7 +263,7 @@ TF_CONFIG='{
266263
}'
267264
```
268265

269-
For TensorFlow's high level [`tf.estimator`](https://www.tensorflow.org/api_docs/python/tf/estimator) API, TensorFlow will parse this `TF_CONFIG` variable and build the cluster spec for you.
266+
For TensorFlow's high level [`tf.estimator`](https://www.tensorflow.org/api_docs/python/tf/estimator) API, TensorFlow parses the `TF_CONFIG` variable and builds the cluster spec for you.
270267

271268
For TensorFlow's lower-level core APIs for training, parse the `TF_CONFIG` variable and build the `tf.train.ClusterSpec` in your training code.
272269

0 commit comments

Comments
 (0)