Skip to content

Commit 8c795d1

Browse files
Merge pull request #225020 from sdgilley/patch-47
updates from Savita
2 parents 4e876e6 + 937f0ad commit 8c795d1

File tree

1 file changed

+12
-10
lines changed

1 file changed

+12
-10
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -62,15 +62,6 @@ Make sure your code follows these tips:
6262

6363
* For the full notebook to run the above example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb)
6464

65-
### DeepSpeed
66-
67-
Don't use DeepSpeed's custom launcher to run distributed training with the [DeepSpeed](https://www.deepspeed.ai/) library on Azure ML. Instead, configure an MPI job to launch the training job [with MPI](https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility).
68-
69-
Make sure your code follows these tips:
70-
71-
* Your Azure ML environment contains DeepSpeed and its dependencies, Open MPI, and mpi4py.
72-
* Create an `MpiConfiguration` with your distribution.
73-
7465
### Environment variables from Open MPI
7566

7667
When running MPI jobs with Open MPI images, the following environment variables for each process launched:
@@ -128,6 +119,17 @@ Azure ML will set the `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, and `NODE_RANK
128119

129120
- For the full notebook to run the above example, see [azureml-examples: Distributed training with PyTorch on CIFAR-10](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/distributed-training/distributed-cifar10.ipynb)
130121

122+
## DeepSpeed
123+
124+
[DeepSpeed](https://www.deepspeed.ai/tutorials/azure/) is supported as a first-class citizen within Azure Machine Learning to run distributed jobs with near linear scalabibility in terms of 
125+
126+
* Increase in model size
127+
* Increase in number of GPUs
128+
129+
`DeepSpeed` can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the `DeepSpeed` launcher to launch distributed training as well as autotuning to get optimal `ds` configuration.
130+
131+
You can use a [curated environment](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`, and `Pytorch` for your DeepSpeed training jobs.
132+
131133
## TensorFlow
132134

133135
If you're using [native distributed TensorFlow](https://www.tensorflow.org/guide/distributed_training) in your training code, such as TensorFlow 2.x's `tf.distribute.Strategy` API, you can launch the distributed job via Azure ML using `distribution` parameters or the `TensorFlowDistribution` object.
@@ -174,4 +176,4 @@ If you create an `AmlCompute` cluster of one of these RDMA-capable, InfiniBand-e
174176
## Next steps
175177

176178
* [Deploy and score a machine learning model by using an online endpoint](how-to-deploy-online-endpoints.md)
177-
* [Reference architecture for distributed deep learning training in Azure](/azure/architecture/reference-architectures/ai/training-deep-learning)
179+
* [Reference architecture for distributed deep learning training in Azure](/azure/architecture/reference-architectures/ai/training-deep-learning)

0 commit comments

Comments
 (0)