You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-distributed-gpu.md
+12-10Lines changed: 12 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,15 +62,6 @@ Make sure your code follows these tips:
62
62
63
63
* For the full notebook to run the above example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb)
64
64
65
-
### DeepSpeed
66
-
67
-
Don't use DeepSpeed's custom launcher to run distributed training with the [DeepSpeed](https://www.deepspeed.ai/) library on Azure ML. Instead, configure an MPI job to launch the training job [with MPI](https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility).
68
-
69
-
Make sure your code follows these tips:
70
-
71
-
* Your Azure ML environment contains DeepSpeed and its dependencies, Open MPI, and mpi4py.
72
-
* Create an `MpiConfiguration` with your distribution.
73
-
74
65
### Environment variables from Open MPI
75
66
76
67
When running MPI jobs with Open MPI images, the following environment variables for each process launched:
@@ -128,6 +119,17 @@ Azure ML will set the `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, and `NODE_RANK
128
119
129
120
- For the full notebook to run the above example, see [azureml-examples: Distributed training with PyTorch on CIFAR-10](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/distributed-training/distributed-cifar10.ipynb)
130
121
122
+
## DeepSpeed
123
+
124
+
[DeepSpeed](https://www.deepspeed.ai/tutorials/azure/) is supported as a first-class citizen within Azure Machine Learning to run distributed jobs with near linear scalabibility in terms of
125
+
126
+
* Increase in model size
127
+
* Increase in number of GPUs
128
+
129
+
`DeepSpeed` can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the `DeepSpeed` launcher to launch distributed training as well as autotuning to get optimal `ds` configuration.
130
+
131
+
You can use a [curated environment](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`, and `Pytorch` for your DeepSpeed training jobs.
132
+
131
133
## TensorFlow
132
134
133
135
If you're using [native distributed TensorFlow](https://www.tensorflow.org/guide/distributed_training) in your training code, such as TensorFlow 2.x's `tf.distribute.Strategy` API, you can launch the distributed job via Azure ML using `distribution` parameters or the `TensorFlowDistribution` object.
@@ -174,4 +176,4 @@ If you create an `AmlCompute` cluster of one of these RDMA-capable, InfiniBand-e
174
176
## Next steps
175
177
176
178
*[Deploy and score a machine learning model by using an online endpoint](how-to-deploy-online-endpoints.md)
177
-
*[Reference architecture for distributed deep learning training in Azure](/azure/architecture/reference-architectures/ai/training-deep-learning)
179
+
*[Reference architecture for distributed deep learning training in Azure](/azure/architecture/reference-architectures/ai/training-deep-learning)
0 commit comments