You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-distributed-gpu.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Distributed GPU training guide (SDK v2)
3
3
titleSuffix: Azure Machine Learning
4
-
description: Learn best practices for performing distributed training with Azure Machine Learning SDK (v2) supported frameworks, such as MPI, Horovod, DeepSpeed, PyTorch, TensorFlow, and InfiniBand.
4
+
description: Learn best practices for distributed training with supported frameworks, such as MPI, Horovod, DeepSpeed, PyTorch, TensorFlow, and InfiniBand.
5
5
author: rtanase
6
6
ms.author: ratanase
7
7
ms.reviewer: sgilley
@@ -27,7 +27,7 @@ Learn more about using distributed GPU training code in Azure Machine Learning.
27
27
28
28
## Prerequisites
29
29
30
-
Review the [basic concepts of distributed GPU training](concept-distributed-training.md) such as _data parallelism_, _distributed data parallelism_, and _model parallelism_.
30
+
Review the basic concepts of [distributed GPU training](concept-distributed-training.md), such as *data parallelism*, *distributed data parallelism*, and *model parallelism*.
31
31
32
32
> [!TIP]
33
33
> If you don't know which type of parallelism to use, more than 90% of the time you should use **distributed data parallelism**.
@@ -63,16 +63,16 @@ Make sure your code follows these tips:
63
63
64
64
### Environment variables from Open MPI
65
65
66
-
When running MPI jobs with Open MPI images, the following environment variables for each process launched:
66
+
When running MPI jobs with Open MPI images, you can use the following environment variables for each process launched:
67
67
68
-
1.`OMPI_COMM_WORLD_RANK`: the rank of the process
69
-
2.`OMPI_COMM_WORLD_SIZE`: the world size
70
-
3.`AZ_BATCH_MASTER_NODE`: the primary address with port, `MASTER_ADDR:MASTER_PORT`
71
-
4.`OMPI_COMM_WORLD_LOCAL_RANK`: the local rank of the process on the node
72
-
5.`OMPI_COMM_WORLD_LOCAL_SIZE`: the number of processes on the node
68
+
1.`OMPI_COMM_WORLD_RANK`: The rank of the process
69
+
2.`OMPI_COMM_WORLD_SIZE`: The world size
70
+
3.`AZ_BATCH_MASTER_NODE`: The primary address with port, `MASTER_ADDR:MASTER_PORT`
71
+
4.`OMPI_COMM_WORLD_LOCAL_RANK`: The local rank of the process on the node
72
+
5.`OMPI_COMM_WORLD_LOCAL_SIZE`: The number of processes on the node
73
73
74
74
> [!TIP]
75
-
> Despite the name, environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
75
+
> Despite the name, the environment variable `OMPI_COMM_WORLD_NODE_RANK` doesn't correspond to the `NODE_RANK`. To use per-node-launcher, set `process_count_per_node=1` and use `OMPI_COMM_WORLD_RANK` as the `NODE_RANK`.
76
76
77
77
## PyTorch
78
78
@@ -93,10 +93,10 @@ The most common communication backends used are `mpi`, `nccl`, and `gloo`. For G
93
93
94
94
`init_method` tells how each process can discover each other, how they initialize and verify the process group using the communication backend. By default, if `init_method` isn't specified, PyTorch uses the environment variable initialization method (`env://`). `init_method` is the recommended initialization method to use in your training code to run distributed PyTorch on Azure Machine Learning. PyTorch looks for the following environment variables for initialization:
95
95
96
-
-**`MASTER_ADDR`**: IP address of the machine that hosts the process with rank 0.
97
-
-**`MASTER_PORT`**: A free port on the machine that hosts the process with rank 0.
98
-
-**`WORLD_SIZE`**: The total number of processes. Should be equal to the total number of devices (GPU) used for distributed training.
99
-
-**`RANK`**: The (global) rank of the current process. The possible values are 0 to (world size - 1).
96
+
-**`MASTER_ADDR`**: IP address of the machine that hosts the process with rank 0
97
+
-**`MASTER_PORT`**: A free port on the machine that hosts the process with rank 0
98
+
-**`WORLD_SIZE`**: The total number of processes. Should be equal to the total number of devices (GPU) used for distributed training
99
+
-**`RANK`**: The (global) rank of the current process. The possible values are 0 to (world size - 1)
100
100
101
101
For more information on process group initialization, see the [PyTorch documentation](https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group).
102
102
@@ -119,22 +119,22 @@ Azure Machine Learning sets the `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, and
119
119
120
120
## DeepSpeed
121
121
122
-
[DeepSpeed](https://www.deepspeed.ai/tutorials/azure/)is supported as a first-class citizen within Azure Machine Learning to run distributed jobs with near linear scalability in terms of:
122
+
Azure Machine Learning supports [DeepSpeed](https://www.deepspeed.ai/tutorials/azure/) as a first-class citizen to run distributed jobs with near linear scalability in terms of:
123
123
124
124
* Increase in model size
125
125
* Increase in number of GPUs
126
126
127
-
`DeepSpeed` can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the `DeepSpeed` launcher to launch distributed training as well as autotuning to get optimal `ds` configuration.
127
+
DeepSpeed can be enabled using either Pytorch distribution or MPI for running distributed training. Azure Machine Learning supports the DeepSpeed launcher to launch distributed training as well as autotuning to get optimal `ds` configuration.
128
128
129
-
You can use a [curated environment](resource-curated-environments.md) for an out of the box environment with the latest state of art technologies including `DeepSpeed`, `ORT`, `MSSCCL`, and `Pytorch` for your DeepSpeed training jobs.
129
+
You can use a [curated environment](resource-curated-environments.md) for an out of the box environment with the latest state of art technologies including DeepSpeed, ORT, MSSCCL, and Pytorch for your DeepSpeed training jobs.
130
130
131
131
### DeepSpeed example
132
132
133
133
* For DeepSpeed training and autotuning examples, see [these folders](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/deepspeed).
134
134
135
135
## TensorFlow
136
136
137
-
If you're using[native distributed TensorFlow](https://www.tensorflow.org/guide/distributed_training) in your training code, such as TensorFlow 2.x's `tf.distribute.Strategy` API, you can launch the distributed job via Azure Machine Learning using `distribution` parameters or the `TensorFlowDistribution` object.
137
+
If you use[native distributed TensorFlow](https://www.tensorflow.org/guide/distributed_training) in your training code, such as TensorFlow 2.x's `tf.distribute.Strategy` API, you can launch the distributed job via Azure Machine Learning using `distribution` parameters or the `TensorFlowDistribution` object.
0 commit comments