Skip to content

Commit f3fd941

Browse files
authored
move ds and change heading level
1 parent 18aba59 commit f3fd941

File tree

1 file changed

+11
-10
lines changed

1 file changed

+11
-10
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -62,16 +62,6 @@ Make sure your code follows these tips:
6262

6363
* For the full notebook to run the above example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb)
6464

65-
### DeepSpeed
66-
[DeepSpeed](https://www.deepspeed.ai/tutorials/azure/) is supported as a first-class citizen within Azure Machine Learning to run distributed jobs with near linear scalabibility in terms of 
67-
68-
* Increase in model size
69-
* Increase in number of GPUs
70-
71-
`DeepSpeed` can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the `DeepSpeed` launcher to launch distributed training as well as autotuning to get optimal `ds` configuration.
72-
73-
You can use a [curated environment](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`. and `Pytorch` for your DeepSpeed training jobs.
74-
7565
### Environment variables from Open MPI
7666

7767
When running MPI jobs with Open MPI images, the following environment variables for each process launched:
@@ -129,6 +119,17 @@ Azure ML will set the `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, and `NODE_RANK
129119

130120
- For the full notebook to run the above example, see [azureml-examples: Distributed training with PyTorch on CIFAR-10](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/pytorch/distributed-training/distributed-cifar10.ipynb)
131121

122+
## DeepSpeed
123+
124+
[DeepSpeed](https://www.deepspeed.ai/tutorials/azure/) is supported as a first-class citizen within Azure Machine Learning to run distributed jobs with near linear scalabibility in terms of 
125+
126+
* Increase in model size
127+
* Increase in number of GPUs
128+
129+
`DeepSpeed` can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the `DeepSpeed` launcher to launch distributed training as well as autotuning to get optimal `ds` configuration.
130+
131+
You can use a [curated environment](resource-curated-environments.md#azure-container-for-pytorch-acpt-preview) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`. and `Pytorch` for your DeepSpeed training jobs.
132+
132133
## TensorFlow
133134

134135
If you're using [native distributed TensorFlow](https://www.tensorflow.org/guide/distributed_training) in your training code, such as TensorFlow 2.x's `tf.distribute.Strategy` API, you can launch the distributed job via Azure ML using `distribution` parameters or the `TensorFlowDistribution` object.

0 commit comments

Comments
 (0)