Skip to content

Commit daff6ed

Browse files
authored
Merge pull request #95788 from thongonary/patch-1
Update instruction on distributed GPU training with PyTorch Lightning
2 parents 4de5811 + 1904b5e commit daff6ed

File tree

1 file changed

+34
-23
lines changed

1 file changed

+34
-23
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -255,55 +255,66 @@ run = Experiment(ws, 'experiment_name').submit(run_config)
255255
256256
[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) is a lightweight open-source library that provides a high-level interface for PyTorch. Lightning abstracts away many of the lower-level distributed training configurations required for vanilla PyTorch. Lightning allows you to run your training scripts in single GPU, single-node multi-GPU, and multi-node multi-GPU settings. Behind the scene, it launches multiple processes for you similar to `torch.distributed.launch`.
257257
258-
For single-node training (including single-node multi-GPU), you can run your code on Azure ML without needing to specify a `distributed_job_config`. For multi-node training, Lightning requires the following environment variables to be set on each node of your training cluster:
258+
For single-node training (including single-node multi-GPU), you can run your code on Azure ML without needing to specify a `distributed_job_config`.
259+
To run an experiment using multiple nodes with multiple GPUs, there are 2 options:
259260
260-
- MASTER_ADDR
261-
- MASTER_PORT
262-
- NODE_RANK
261+
- Using PyTorch configuration (recommended): Define `PyTorchConfiguration` and specify `communication_backend="Nccl"`, `node_count`, and `process_count` (note that this is the total number of processes, ie, `num_nodes * process_count_per_node`). In Lightning Trainer module, specify both `num_nodes` and `gpus` to be consistent with `PyTorchConfiguration`. For example, `num_nodes = node_count` and `gpus = process_count_per_node`.
263262
264-
To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
263+
- Using MPI Configuration:
265264
266-
To run an experiment using multiple nodes with multiple GPUs:
267-
268-
- Define `MpiConfiguration` and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
269-
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables that Lightning requires:
265+
- Define `MpiConfiguration` and specify both `node_count` and `process_count_per_node`. In Lightning Trainer, specify both `num_nodes` and `gpus` to be respectively the same as `node_count` and `process_count_per_node` from `MpiConfiguration`.
266+
- For multi-node training with MPI, Lightning requires the following environment variables to be set on each node of your training cluster:
267+
- MASTER_ADDR
268+
- MASTER_PORT
269+
- NODE_RANK
270+
- LOCAL_RANK
271+
272+
Manually set these environment variables that Lightning requires in the main training scripts:
270273
271274
```python
272275
import os
276+
from argparse import ArgumentParser
273277
274-
def set_environment_variables_for_nccl_backend(single_node=False, master_port=6105):
275-
if not single_node:
276-
master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
277-
os.environ["MASTER_ADDR"] = master_node_params[0]
278-
279-
# Do not overwrite master port with that defined in AZ_BATCH_MASTER_NODE
280-
if "MASTER_PORT" not in os.environ:
281-
os.environ["MASTER_PORT"] = str(master_port)
278+
def set_environment_variables_for_mpi(num_nodes, gpus_per_node, master_port=54965):
279+
if num_nodes > 1:
280+
os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
282281
else:
283282
os.environ["MASTER_ADDR"] = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
284-
os.environ["MASTER_PORT"] = "54965"
283+
os.environ["MASTER_PORT"] = str(master_port)
285284
286-
os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
287285
try:
288-
os.environ["NODE_RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]
286+
os.environ["NODE_RANK"] = str(int(os.environ.get("OMPI_COMM_WORLD_RANK")) // gpus_per_node)
289287
# additional variables
290288
os.environ["MASTER_ADDRESS"] = os.environ["MASTER_ADDR"]
291289
os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
292290
os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
293291
except:
294292
# fails when used with pytorch configuration instead of mpi
295293
pass
294+
295+
if __name__ == "__main__":
296+
parser = ArgumentParser()
297+
parser.add_argument("--num_nodes", type=int, required=True)
298+
parser.add_argument("--gpus_per_node", type=int, required=True)
299+
args = parser.parse_args()
300+
set_environment_variables_for_mpi(args.num_nodes, args.gpus_per_node)
301+
302+
trainer = Trainer(
303+
num_nodes=args.num_nodes,
304+
gpus=args.gpus_per_node
305+
)
296306
```
297307
298-
- Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manages rank and local rank internally:
308+
Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes`.
299309
300310
```python
301311
from azureml.core import ScriptRunConfig, Experiment
302312
from azureml.core.runconfig import MpiConfiguration
303313
304314
nnodes = 2
305-
args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp_spawn', '--num_nodes', nnodes]
306-
distr_config = MpiConfiguration(node_count=nnodes)
315+
gpus_per_node = 4
316+
args = ['--max_epochs', 50, '--gpus_per_node', gpus_per_node, '--accelerator', 'ddp', '--num_nodes', nnodes]
317+
distr_config = MpiConfiguration(node_count=nnodes, process_count_per_node=gpus_per_node)
307318
308319
run_config = ScriptRunConfig(
309320
source_directory='./src',

0 commit comments

Comments
 (0)