You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-distributed-gpu.md
+1-46Lines changed: 1 addition & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Distributed GPU training guide (SDK v2)
3
3
titleSuffix: Azure Machine Learning
4
-
description: Learn best practices for distributed training with supported frameworks, such as MPI, Horovod, DeepSpeed, PyTorch, TensorFlow, and InfiniBand.
4
+
description: Learn best practices for distributed training with supported frameworks, such as PyTorch, DeepSpeed, TensorFlow, and InfiniBand.
5
5
author: sdgilley
6
6
ms.author: sgilley
7
7
ms.reviewer: ratanase
@@ -18,9 +18,6 @@ ms.custom: sdkv2, update-code1
18
18
19
19
Learn more about using distributed GPU training code in Azure Machine Learning. This article helps you run your existing distributed training code, and offers tips and examples for you to follow for each framework:
20
20
21
-
* Message Passing Interface (MPI)
22
-
* Horovod
23
-
* Environment variables from Open MPI
24
21
* PyTorch
25
22
* TensorFlow
26
23
* Accelerate GPU training with InfiniBand
@@ -32,48 +29,6 @@ Review the basic concepts of [distributed GPU training](concept-distributed-trai
32
29
> [!TIP]
33
30
> If you don't know which type of parallelism to use, more than 90% of the time you should use **distributed data parallelism**.
34
31
35
-
## MPI
36
-
37
-
Azure Machine Learning offers an [MPI job](https://www.mcs.anl.gov/research/projects/mpi/) to launch a given number of processes in each node. Azure Machine Learning constructs the full MPI launch command (`mpirun`) behind the scenes. You can't provide your own full head-node-launcher commands like `mpirun` or `DeepSpeed launcher`.
38
-
39
-
> [!TIP]
40
-
> The base Docker image used by an Azure Machine Learning MPI job needs to have an MPI library installed. [Open MPI](https://www.open-mpi.org) is included in all the [Azure Machine Learning GPU base images](https://github.com/Azure/AzureML-Containers). When you use a custom Docker image, you are responsible for making sure the image includes an MPI library. Open MPI is recommended, but you can also use a different MPI implementation such as Intel MPI. Azure Machine Learning also provides [curated environments](resource-curated-environments.md) for popular frameworks.
41
-
42
-
To run distributed training using MPI, follow these steps:
43
-
44
-
1. Use an Azure Machine Learning environment with the preferred deep learning framework and MPI. Azure Machine Learning provides [curated environments](resource-curated-environments.md) for popular frameworks. Or [create a custom environment](how-to-manage-environments-v2.md#create-a-custom-environment) with the preferred deep learning framework and MPI.
45
-
1. Define a `command` with `instance_count`. `instance_count` should be equal to the number of GPUs per node for per-process-launch, or set to 1 (the default) for per-node-launch if the user script is responsible for launching the processes per node.
46
-
1. Use the `distribution` parameter of the `command` to specify settings for `MpiDistribution`.
Use the MPI job configuration when you use [Horovod](https://horovod.readthedocs.io/en/stable/index.html) for distributed training with the deep learning framework.
53
-
54
-
Make sure your code follows these tips:
55
-
56
-
* The training code is instrumented correctly with Horovod before adding the Azure Machine Learning parts.
57
-
* Your Azure Machine Learning environment contains Horovod and MPI. The PyTorch and TensorFlow curated GPU environments come preconfigured with Horovod and its dependencies.
58
-
* Create a `command` with your desired distribution.
59
-
60
-
### Horovod example
61
-
62
-
* For the full notebook to run the Horovod example, see [azureml-examples: Train a basic neural network with distributed MPI on the MNIST dataset using Horovod](https://github.com/Azure/azureml-examples/blob/main/sdk/python/jobs/single-step/tensorflow/mnist-distributed-horovod/tensorflow-mnist-distributed-horovod.ipynb).
63
-
64
-
### Environment variables from Open MPI
65
-
66
-
When running MPI jobs with Open MPI images, you can use the following environment variables for each process launched:
67
-
68
-
1.`OMPI_COMM_WORLD_RANK`: The rank of the process
69
-
2.`OMPI_COMM_WORLD_SIZE`: The world size
70
-
3.`AZ_BATCH_MASTER_NODE`: The primary address with port, `MASTER_ADDR:MASTER_PORT`
71
-
4.`OMPI_COMM_WORLD_LOCAL_RANK`: The local rank of the process on the node
72
-
5.`OMPI_COMM_WORLD_LOCAL_SIZE`: The number of processes on the node
73
-
74
-
> [!TIP]
75
-
> Despite the name, the environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
76
-
77
32
## PyTorch
78
33
79
34
Azure Machine Learning supports running distributed jobs using PyTorch's native distributed training capabilities (`torch.distributed`).
0 commit comments