Skip to content

Commit d53dea5

Browse files
authored
Merge pull request #85333 from fabiofumarola/patch-1
Update how-to-train-distributed-gpu.md
2 parents b1908e9 + fd7c7e4 commit d53dea5

File tree

1 file changed

+55
-25
lines changed

1 file changed

+55
-25
lines changed

articles/machine-learning/how-to-train-distributed-gpu.md

Lines changed: 55 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -258,31 +258,61 @@ For single-node training (including single-node multi-GPU), you can run your cod
258258
- MASTER_PORT
259259
- NODE_RANK
260260
261-
To run multi-node Lightning training on Azure ML, you can largely follow the [per-node-launch guide](#per-node-launch):
262-
263-
- Define the `PyTorchConfiguration` and specify the `node_count`. Don't specify `process_count`, as Lightning internally handles launching the worker processes for each node.
264-
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables required by Lightning.
265-
- Lightning will handle computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manage rank and local rank internally.
266-
267-
```python
268-
from azureml.core import ScriptRunConfig, Experiment
269-
from azureml.core.runconfig import PyTorchConfiguration
270-
271-
nnodes = 2
272-
args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp', '--num_nodes', nnodes]
273-
distr_config = PyTorchConfiguration(node_count=nnodes)
274-
275-
run_config = ScriptRunConfig(
276-
source_directory='./src',
277-
script='train.py',
278-
arguments=args,
279-
compute_target=compute_target,
280-
environment=pytorch_env,
281-
distributed_job_config=distr_config,
282-
)
283-
284-
run = Experiment(ws, 'experiment_name').submit(run_config)
285-
```
261+
To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
262+
263+
To run an experiment using multiple nodes with multiple GPUs:
264+
265+
- Define `MpiConfiguration` and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
266+
- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables that Lightning requires:
267+
268+
```python
269+
import os
270+
271+
def set_environment_variables_for_nccl_backend(single_node=False, master_port=6105):
272+
if not single_node:
273+
master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
274+
os.environ["MASTER_ADDR"] = master_node_params[0]
275+
276+
# Do not overwrite master port with that defined in AZ_BATCH_MASTER_NODE
277+
if "MASTER_PORT" not in os.environ:
278+
os.environ["MASTER_PORT"] = str(master_port)
279+
else:
280+
os.environ["MASTER_ADDR"] = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
281+
os.environ["MASTER_PORT"] = "54965"
282+
283+
os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
284+
try:
285+
os.environ["NODE_RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]
286+
# additional variables
287+
os.environ["MASTER_ADDRESS"] = os.environ["MASTER_ADDR"]
288+
os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
289+
os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
290+
except:
291+
# fails when used with pytorch configuration instead of mpi
292+
pass
293+
```
294+
295+
- Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manages rank and local rank internally:
296+
297+
```python
298+
from azureml.core import ScriptRunConfig, Experiment
299+
from azureml.core.runconfig import MpiConfiguration
300+
301+
nnodes = 2
302+
args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp_spawn', '--num_nodes', nnodes]
303+
distr_config = MpiConfiguration(node_count=nnodes)
304+
305+
run_config = ScriptRunConfig(
306+
source_directory='./src',
307+
script='train.py',
308+
arguments=args,
309+
compute_target=compute_target,
310+
environment=pytorch_env,
311+
distributed_job_config=distr_config,
312+
)
313+
314+
run = Experiment(ws, 'experiment_name').submit(run_config)
315+
```
286316
287317
### Hugging Face Transformers
288318

0 commit comments

Comments
 (0)