Merge pull request #95788 from thongonary/patch-1

Jak-MS · web-flow · commit daff6ede03c4 · 2022-07-20T14:35:02.000-07:00
Update instruction on distributed GPU training with PyTorch Lightning
diff --git a/articles/machine-learning/how-to-train-distributed-gpu.md b/articles/machine-learning/how-to-train-distributed-gpu.md
@@ -255,55 +255,66 @@ run = Experiment(ws, 'experiment_name').submit(run_config)
 
 [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/stable/) is a lightweight open-source library that provides a high-level interface for PyTorch. Lightning abstracts away many of the lower-level distributed training configurations required for vanilla PyTorch. Lightning allows you to run your training scripts in single GPU, single-node multi-GPU, and multi-node multi-GPU settings. Behind the scene, it launches multiple processes for you similar to `torch.distributed.launch`.
 
-For single-node training (including single-node multi-GPU), you can run your code on Azure ML without needing to specify a `distributed_job_config`. For multi-node training, Lightning requires the following environment variables to be set on each node of your training cluster:
+For single-node training (including single-node multi-GPU), you can run your code on Azure ML without needing to specify a `distributed_job_config`. 
+To run an experiment using multiple nodes with multiple GPUs, there are 2 options:
 
-- MASTER_ADDR
-- MASTER_PORT
-- NODE_RANK
+- Using PyTorch configuration (recommended): Define `PyTorchConfiguration` and specify `communication_backend="Nccl"`, `node_count`, and `process_count` (note that this is the total number of processes, ie, `num_nodes * process_count_per_node`). In Lightning Trainer module, specify both `num_nodes` and `gpus` to be consistent with `PyTorchConfiguration`. For example, `num_nodes = node_count` and `gpus = process_count_per_node`.
 
-To run multi-node Lightning training on Azure ML, follow the [per-node-launch](#per-node-launch) guidance, but note that currently, the `ddp` strategy works only when you run an experiment using multiple nodes, with one GPU per node.
+- Using MPI Configuration: 
 
-To run an experiment using multiple nodes with multiple GPUs:
-
-- Define `MpiConfiguration` and specify `node_count`. Don't specify `process_count` because Lightning internally handles launching the worker processes for each node.
-- For PyTorch jobs, Azure ML handles setting the MASTER_ADDR, MASTER_PORT, and NODE_RANK environment variables that Lightning requires:
+   - Define `MpiConfiguration` and specify both `node_count` and `process_count_per_node`. In Lightning Trainer, specify both `num_nodes` and `gpus` to be respectively the same as `node_count` and `process_count_per_node` from `MpiConfiguration`. 
+   - For multi-node training with MPI, Lightning requires the following environment variables to be set on each node of your training cluster:
+      - MASTER_ADDR
+      - MASTER_PORT
+      - NODE_RANK
+      - LOCAL_RANK
+      
+      Manually set these environment variables that Lightning requires in the main training scripts:
 
    ```python
    import os
+   from argparse import ArgumentParser
 
-   def set_environment_variables_for_nccl_backend(single_node=False, master_port=6105):
-       if not single_node:
-           master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
-           os.environ["MASTER_ADDR"] = master_node_params[0]
-
-           # Do not overwrite master port with that defined in AZ_BATCH_MASTER_NODE
-           if "MASTER_PORT" not in os.environ:
-               os.environ["MASTER_PORT"] = str(master_port)
+   def set_environment_variables_for_mpi(num_nodes, gpus_per_node, master_port=54965):
+       if num_nodes > 1:
+           os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
        else:
            os.environ["MASTER_ADDR"] = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
-           os.environ["MASTER_PORT"] = "54965"
+           os.environ["MASTER_PORT"] = str(master_port)
 
-       os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
        try:
-           os.environ["NODE_RANK"] = os.environ["OMPI_COMM_WORLD_RANK"]
+           os.environ["NODE_RANK"] = str(int(os.environ.get("OMPI_COMM_WORLD_RANK")) // gpus_per_node)
            # additional variables
            os.environ["MASTER_ADDRESS"] = os.environ["MASTER_ADDR"]
            os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
            os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
        except:
            # fails when used with pytorch configuration instead of mpi
            pass
+           
+   if __name__ == "__main__":
+       parser = ArgumentParser()
+       parser.add_argument("--num_nodes", type=int, required=True)
+       parser.add_argument("--gpus_per_node", type=int, required=True)
+       args = parser.parse_args()
+       set_environment_variables_for_mpi(args.num_nodes, args.gpus_per_node)
+       
+       trainer = Trainer(
+        num_nodes=args.num_nodes,
+        gpus=args.gpus_per_node
+    )
    ```
 
-- Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes` and manages rank and local rank internally:
+     Lightning handles computing the world size from the Trainer flags `--gpus` and `--num_nodes`.
 
    ```python
    from azureml.core import ScriptRunConfig, Experiment
    from azureml.core.runconfig import MpiConfiguration
 
    nnodes = 2
-   args = ['--max_epochs', 50, '--gpus', 2, '--accelerator', 'ddp_spawn', '--num_nodes', nnodes]
-   distr_config = MpiConfiguration(node_count=nnodes)
+   gpus_per_node = 4
+   args = ['--max_epochs', 50, '--gpus_per_node', gpus_per_node, '--accelerator', 'ddp', '--num_nodes', nnodes]
+   distr_config = MpiConfiguration(node_count=nnodes, process_count_per_node=gpus_per_node)
 
    run_config = ScriptRunConfig(
      source_directory='./src',