Skip to content

Commit e461e90

Browse files
authored
Update the Multi-GPU docs (#19525)
1 parent a89ea11 commit e461e90

File tree

6 files changed

+105
-152
lines changed

6 files changed

+105
-152
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ docs/source-pytorch/_static/images/course_UvA-DL
2323
docs/source-pytorch/_static/images/lightning_examples
2424
docs/source-pytorch/_static/fetched-s3-assets
2525
docs/source-pytorch/integrations/hpu
26+
docs/source-pytorch/integrations/strategies/Hivemind.rst
2627

2728
docs/source-fabric/*/generated
2829

docs/source-pytorch/accelerators/gpu_intermediate.rst

Lines changed: 41 additions & 127 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,19 @@ GPU training (Intermediate)
88

99
----
1010

11-
Distributed Training strategies
11+
12+
Distributed training strategies
1213
-------------------------------
1314
Lightning supports multiple ways of doing distributed training.
1415

16+
- Regular (``strategy='ddp'``)
17+
- Spawn (``strategy='ddp_spawn'``)
18+
- Notebook/Fork (``strategy='ddp_notebook'``)
19+
1520
.. video:: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+4-+multi+node+training_3.mp4
1621
:poster: https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_multi_gpus.png
1722
:width: 400
1823

19-
- DistributedDataParallel (multiple-gpus across many machines)
20-
- Regular (``strategy='ddp'``)
21-
- Spawn (``strategy='ddp_spawn'``)
22-
- Notebook/Fork (``strategy='ddp_notebook'``)
2324

2425
.. note::
2526
If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used.
@@ -28,22 +29,22 @@ For a deeper understanding of what Lightning is doing, feel free to read this
2829
`guide <https://medium.com/@_willfalcon/9-tips-for-training-lightning-fast-neural-networks-in-pytorch-8e63a502f565>`_.
2930

3031

32+
----
33+
34+
3135
Distributed Data Parallel
3236
^^^^^^^^^^^^^^^^^^^^^^^^^
3337
:class:`~torch.nn.parallel.DistributedDataParallel` (DDP) works as follows:
3438

3539
1. Each GPU across each node gets its own process.
36-
3740
2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.
38-
3941
3. Each process inits the model.
40-
4142
4. Each process performs a full forward and backward pass in parallel.
42-
4343
5. The gradients are synced and averaged across all processes.
44-
4544
6. Each process updates its optimizer.
4645

46+
|
47+
4748
.. code-block:: python
4849
4950
# train on 8 GPUs (same machine (ie: node))
@@ -59,34 +60,31 @@ variables:
5960
6061
# example for 3 GPUs DDP
6162
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
62-
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
63-
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
63+
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=1 python my_file.py --accelerator 'gpu' --devices 3 --etc
64+
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=2 python my_file.py --accelerator 'gpu' --devices 3 --etc
6465
65-
We use DDP this way because `ddp_spawn` has a few limitations (due to Python and PyTorch):
66+
Using DDP this way has a few disadvantages over ``torch.multiprocessing.spawn()``:
6667

67-
1. Since `.spawn()` trains the model in subprocesses, the model on the main process does not get updated.
68-
2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP... ie: it will be VERY slow or won't work at all. This is a PyTorch limitation.
69-
3. Forces everything to be picklable.
68+
1. All processes (including the main process) participate in training and have the updated state of the model and Trainer state.
69+
2. No multiprocessing pickle errors
70+
3. Easily scales to multi-node training
7071

71-
There are cases in which it is NOT possible to use DDP. Examples are:
72+
|
7273
73-
- Jupyter Notebook, Google COLAB, Kaggle, etc.
74-
- You have a nested script without a root package
74+
It is NOT possible to use DDP in interactive environments like Jupyter Notebook, Google COLAB, Kaggle, etc.
75+
In these situations you should use `ddp_notebook`.
7576

76-
In these situations you should use `ddp_notebook` or `dp` instead.
7777

78-
Distributed Data Parallel Spawn
79-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
80-
`ddp_spawn` is exactly like `ddp` except that it uses .spawn to start the training processes.
78+
----
8179

82-
.. warning:: It is STRONGLY recommended to use `DDP` for speed and performance.
8380

84-
.. code-block:: python
81+
Distributed Data Parallel Spawn
82+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8583

86-
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
84+
.. warning:: It is STRONGLY recommended to use DDP for speed and performance.
8785

88-
If your script does not support being called from the command line (ie: it is nested without a root
89-
project module) you can use the following method:
86+
The `ddp_spawn` strategy is similar to `ddp` except that it uses ``torch.multiprocessing.spawn()`` to start the training processes.
87+
Use this for debugging only, or if you are converting a code base to Lightning that relies on spawn.
9088

9189
.. code-block:: python
9290
@@ -95,54 +93,12 @@ project module) you can use the following method:
9593
9694
We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):
9795

98-
1. The model you pass in will not update. Please save a checkpoint and restore from there.
99-
2. Set Dataloader(num_workers=0) or it will bottleneck training.
96+
1. After ``.fit()``, only the model's weights get restored to the main process, but no other state of the Trainer.
97+
2. Does not support multi-node training.
98+
3. It is generally slower than DDP.
10099

101-
`ddp` is MUCH faster than `ddp_spawn`. We recommend you
102-
103-
1. Install a top-level module for your project using setup.py
104-
105-
.. code-block:: python
106-
107-
# setup.py
108-
#!/usr/bin/env python
109-
110-
from setuptools import setup, find_packages
111-
112-
setup(
113-
name="src",
114-
version="0.0.1",
115-
description="Describe Your Cool Project",
116-
author="",
117-
author_email="",
118-
url="https://github.com/YourSeed", # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
119-
install_requires=["lightning"],
120-
packages=find_packages(),
121-
)
122-
123-
2. Setup your project like so:
124-
125-
.. code-block:: bash
126100

127-
/project
128-
/src
129-
some_file.py
130-
/or_a_folder
131-
setup.py
132-
133-
3. Install as a root-level package
134-
135-
.. code-block:: bash
136-
137-
cd /project
138-
pip install -e .
139-
140-
You can then call your scripts anywhere
141-
142-
.. code-block:: bash
143-
144-
cd /project/src
145-
python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'
101+
----
146102

147103

148104
Distributed Data Parallel in Notebooks
@@ -165,8 +121,11 @@ The Trainer enables it by default when such environments are detected.
165121
Among the native distributed strategies, regular DDP (``strategy="ddp"``) is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.
166122

167123

124+
----
125+
126+
168127
Comparison of DDP variants and tradeoffs
169-
****************************************
128+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
170129

171130
.. list-table:: DDP variants and their tradeoffs
172131
:widths: 40 20 20 20
@@ -202,68 +161,23 @@ Comparison of DDP variants and tradeoffs
202161
- Fast
203162

204163

205-
Distributed and 16-bit precision
206-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207-
208-
Below are the possible configurations we support.
209-
210-
+-------+---------+-----+--------+-----------------------------------------------------------------------+
211-
| 1 GPU | 1+ GPUs | DDP | 16-bit | command |
212-
+=======+=========+=====+========+=======================================================================+
213-
| Y | | | | `Trainer(accelerator="gpu", devices=1)` |
214-
+-------+---------+-----+--------+-----------------------------------------------------------------------+
215-
| Y | | | Y | `Trainer(accelerator="gpu", devices=1, precision=16)` |
216-
+-------+---------+-----+--------+-----------------------------------------------------------------------+
217-
| | Y | Y | | `Trainer(accelerator="gpu", devices=k, strategy='ddp')` |
218-
+-------+---------+-----+--------+-----------------------------------------------------------------------+
219-
| | Y | Y | Y | `Trainer(accelerator="gpu", devices=k, strategy='ddp', precision=16)` |
220-
+-------+---------+-----+--------+-----------------------------------------------------------------------+
221-
222-
DDP can also be used with 1 GPU, but there's no reason to do so other than debugging distributed-related issues.
223-
224-
225-
Implement Your Own Distributed (DDP) training
226-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
227-
If you need your own way to init PyTorch DDP you can override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.setup_distributed`.
228-
229-
If you also need to use your own DDP implementation, override :meth:`lightning.pytorch.strategies.ddp.DDPStrategy.configure_ddp`.
164+
----
230165

231-
----------
232166

233-
Torch Distributed Elastic
234-
-------------------------
235-
Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the 'ddp' backend and the number of GPUs you want to use in the trainer.
167+
TorchRun (TorchElastic)
168+
-----------------------
169+
Lightning supports the use of TorchRun (previously known as TorchElastic) to enable fault-tolerant and elastic distributed job scheduling.
170+
To use it, specify the DDP strategy and the number of GPUs you want to use in the Trainer.
236171

237172
.. code-block:: python
238173
239174
Trainer(accelerator="gpu", devices=8, strategy="ddp")
240175
241-
To launch a fault-tolerant job, run the following on all nodes.
176+
Then simply launch your script with the :doc:`torchrun <../clouds/cluster_intermediate_2>` command.
242177

243-
.. code-block:: bash
244-
245-
python -m torch.distributed.run
246-
--nnodes=NUM_NODES
247-
--nproc_per_node=TRAINERS_PER_NODE
248-
--rdzv_id=JOB_ID
249-
--rdzv_backend=c10d
250-
--rdzv_endpoint=HOST_NODE_ADDR
251-
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
252-
253-
To launch an elastic job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes.
254178

255-
.. code-block:: bash
256-
257-
python -m torch.distributed.run
258-
--nnodes=MIN_SIZE:MAX_SIZE
259-
--nproc_per_node=TRAINERS_PER_NODE
260-
--rdzv_id=JOB_ID
261-
--rdzv_backend=c10d
262-
--rdzv_endpoint=HOST_NODE_ADDR
263-
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)
179+
----
264180

265-
See the official `Torch Distributed Elastic documentation <https://pytorch.org/docs/stable/distributed.elastic.html>`_ for details
266-
on installation and more use cases.
267181

268182
Optimize multi-machine communication
269183
------------------------------------

docs/source-pytorch/clouds/cluster_advanced.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ schedules the resources and time for which the job is allowed to run.
1515

1616
----
1717

18+
1819
***************************
1920
Design your training script
2021
***************************

docs/source-pytorch/clouds/cluster_intermediate_1.rst

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,15 @@ Run on an on-prem cluster (intermediate)
55
########################################
66
**Audience**: Users who need to run on an academic or enterprise private cluster.
77

8+
89
----
910

11+
1012
.. _non-slurm:
1113

12-
*****************
13-
Setup the cluster
14-
*****************
14+
******************
15+
Set up the cluster
16+
******************
1517
This guide shows how to run a training job on a general purpose cluster. We recommend beginners to try this method
1618
first because it requires the least amount of configuration and changes to the code.
1719
To setup a multi-node computing cluster you need:
@@ -29,11 +31,13 @@ PyTorch Lightning follows the design of `PyTorch distributed communication packa
2931

3032
.. _training_script_setup:
3133

34+
3235
----
3336

34-
*************************
35-
Setup the training script
36-
*************************
37+
38+
**************************
39+
Set up the training script
40+
**************************
3741
To train a model using multiple nodes, do the following:
3842

3943
1. Design your :ref:`lightning_module` (no need to add anything specific here).
@@ -45,8 +49,10 @@ To train a model using multiple nodes, do the following:
4549
# train on 32 GPUs across 4 nodes
4650
trainer = Trainer(accelerator="gpu", devices=8, num_nodes=4, strategy="ddp")
4751
52+
4853
----
4954

55+
5056
***************************
5157
Submit a job to the cluster
5258
***************************
@@ -57,8 +63,10 @@ This means that you need to:
5763
2. Copy all your import dependencies and the script itself to each node.
5864
3. Run the script on each node.
5965

66+
6067
----
6168

69+
6270
******************
6371
Debug on a cluster
6472
******************

0 commit comments

Comments
 (0)