You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-distributed-gpu.md
+34-23Lines changed: 34 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -255,55 +255,66 @@ run = Experiment(ws, 'experiment_name').submit(run_config)
255
255
256
256
[PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) is a lightweight open-source library that provides a high-level interface for PyTorch. Lightning abstracts away many of the lower-level distributed training configurations required for vanilla PyTorch. Lightning allows you to run your training scripts in single GPU, single-node multi-GPU, and multi-node multi-GPU settings. Behind the scene, it launches multiple processes for you similar to `torch.distributed.launch`.
257
257
258
-
For single-node training (including single-node multi-GPU), you can run your code on Azure ML without needing to specify a `distributed_job_config`. For multi-node training, Lightning requires the following environment variables to be set on each node of your training cluster:
258
+
For single-node training (including single-node multi-GPU), you can run your code on Azure ML without needing to specify a `distributed_job_config`.
259
+
To run an experiment using multiple nodes with multiple GPUs, there are 2 options:
259
260
260
-
-MASTER_ADDR
261
-
-MASTER_PORT
262
-
-NODE_RANK
261
+
- Using PyTorch configuration (recommended): Define `PyTorchConfiguration`and specify `communication_backend="Nccl"`, `node_count`, and`process_count` (note that this is the total number of processes, ie, `num_nodes * process_count_per_node`). In Lightning Trainer module, specify both `num_nodes`and`gpus` to be consistent with`PyTorchConfiguration`. For example, `num_nodes = node_count`and`gpus = process_count_per_node`.
263
262
264
-
To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
263
+
- Using MPI Configuration:
265
264
266
-
To run an experiment using multiple nodes with multiple GPUs:
267
-
268
-
- Define `MpiConfiguration`and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
269
-
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, andNODE_RANK environment variables that Lightning requires:
265
+
- Define `MpiConfiguration`and specify both `node_count`and`process_count_per_node`. In Lightning Trainer, specify both `num_nodes`and`gpus` to be respectively the same as`node_count`and`process_count_per_node`from`MpiConfiguration`.
266
+
- For multi-node training withMPI, Lightning requires the following environment variables to be set on each node of your training cluster:
267
+
-MASTER_ADDR
268
+
-MASTER_PORT
269
+
-NODE_RANK
270
+
-LOCAL_RANK
271
+
272
+
Manually set these environment variables that Lightning requires in the main training scripts:
0 commit comments