Skip to content

Commit 258b353

Browse files
authored
Merge pull request #280226 from sdgilley/sdg-content-maintenance
Update how-to-train-distributed-gpu.md
2 parents 648ac2b + 19f62c7 commit 258b353

File tree

1 file changed

+1
-46
lines changed

1 file changed

+1
-46
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 1 addition & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Distributed GPU training guide (SDK v2)
33
titleSuffix: Azure Machine Learning
4-
description: Learn best practices for distributed training with supported frameworks, such as MPI, Horovod, DeepSpeed, PyTorch, TensorFlow, and InfiniBand.
4+
description: Learn best practices for distributed training with supported frameworks, such as PyTorch, DeepSpeed, TensorFlow, and InfiniBand.
55
author: sdgilley
66
ms.author: sgilley
77
ms.reviewer: ratanase
@@ -18,9 +18,6 @@ ms.custom: sdkv2, update-code1
1818

1919
Learn more about using distributed GPU training code in Azure Machine Learning. This article helps you run your existing distributed training code, and offers tips and examples for you to follow for each framework:
2020

21-
* Message Passing Interface (MPI)
22-
* Horovod
23-
* Environment variables from Open MPI
2421
* PyTorch
2522
* TensorFlow
2623
* Accelerate GPU training with InfiniBand
@@ -32,48 +29,6 @@ Review the basic concepts of [distributed GPU training](concept-distributed-trai
3229
> [!TIP]
3330
> If you don't know which type of parallelism to use, more than 90% of the time you should use **distributed data parallelism**.
3431
35-
## MPI
36-
37-
Azure Machine Learning offers an [MPI job](https://www.mcs.anl.gov/research/projects/mpi/) to launch a given number of processes in each node. Azure Machine Learning constructs the full MPI launch command (`mpirun`) behind the scenes. You can't provide your own full head-node-launcher commands like `mpirun` or `DeepSpeed launcher`.
38-
39-
> [!TIP]
40-
> The base Docker image used by an Azure Machine Learning MPI job needs to have an MPI library installed. [Open MPI](https://www.open-mpi.org) is included in all the [Azure Machine Learning GPU base images](https://github.com/Azure/AzureML-Containers). When you use a custom Docker image, you are responsible for making sure the image includes an MPI library. Open MPI is recommended, but you can also use a different MPI implementation such as Intel MPI. Azure Machine Learning also provides [curated environments](resource-curated-environments.md) for popular frameworks.
41-
42-
To run distributed training using MPI, follow these steps:
43-
44-
1. Use an Azure Machine Learning environment with the preferred deep learning framework and MPI. Azure Machine Learning provides [curated environments](resource-curated-environments.md) for popular frameworks. Or [create a custom environment](how-to-manage-environments-v2.md#create-a-custom-environment) with the preferred deep learning framework and MPI.
45-
1. Define a `command` with `instance_count`. `instance_count` should be equal to the number of GPUs per node for per-process-launch, or set to 1 (the default) for per-node-launch if the user script is responsible for launching the processes per node.
46-
1. Use the `distribution` parameter of the `command` to specify settings for `MpiDistribution`.
47-
48-
[!notebook-python[](~/azureml-examples-temp-fix/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb?name=job)]
49-
50-
### Horovod
51-
52-
Use the MPI job configuration when you use [Horovod](https://horovod.readthedocs.io/en/stable/index.html) for distributed training with the deep learning framework.
53-
54-
Make sure your code follows these tips:
55-
56-
* The training code is instrumented correctly with Horovod before adding the Azure Machine Learning parts.
57-
* Your Azure Machine Learning environment contains Horovod and MPI. The PyTorch and TensorFlow curated GPU environments come preconfigured with Horovod and its dependencies.
58-
* Create a `command` with your desired distribution.
59-
60-
### Horovod example
61-
62-
* For the full notebook to run the Horovod example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb).
63-
64-
### Environment variables from Open MPI
65-
66-
When running MPI jobs with Open MPI images, you can use the following environment variables for each process launched:
67-
68-
1. `OMPI_COMM_WORLD_RANK`: The rank of the process
69-
2. `OMPI_COMM_WORLD_SIZE`: The world size
70-
3. `AZ_BATCH_MASTER_NODE`: The primary address with port, `MASTER_ADDR:MASTER_PORT`
71-
4. `OMPI_COMM_WORLD_LOCAL_RANK`: The local rank of the process on the node
72-
5. `OMPI_COMM_WORLD_LOCAL_SIZE`: The number of processes on the node
73-
74-
> [!TIP]
75-
> Despite the name, the environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
76-
7732
## PyTorch
7833

7934
Azure Machine Learning supports running distributed jobs using PyTorch's native distributed training capabilities (`torch.distributed`).

0 commit comments

Comments
 (0)