You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/how-to-train-distributed-gpu.md
+55-25Lines changed: 55 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -258,31 +258,61 @@ For single-node training (including single-node multi-GPU), you can run your cod
258
258
-MASTER_PORT
259
259
-NODE_RANK
260
260
261
-
To run multi-node Lightning training on Azure ML, you can largely follow the [per-node-launch guide](#per-node-launch):
262
-
263
-
- Define the `PyTorchConfiguration`and specify the `node_count`. Don't specify `process_count`, as Lightning internally handles launching the worker processes for each node.
264
-
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, andNODE_RANK environment variables required by Lightning.
265
-
- Lightning will handle computing the world size from the Trainer flags `--gpus`and`--num_nodes`and manage rank and local rank internally.
266
-
267
-
```python
268
-
from azureml.core import ScriptRunConfig, Experiment
269
-
from azureml.core.runconfig import PyTorchConfiguration
To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
262
+
263
+
To run an experiment using multiple nodes with multiple GPUs:
264
+
265
+
- Define `MpiConfiguration`and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
266
+
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, andNODE_RANK environment variables that Lightning requires:
0 commit comments