Skip to content

Commit 2eb6214

Browse files
authored
Update the DDP optimizations page (#18344)
1 parent 2ca1571 commit 2eb6214

File tree

2 files changed

+60
-46
lines changed

2 files changed

+60
-46
lines changed

docs/source-pytorch/advanced/ddp_optimizations.rst

Lines changed: 58 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,39 +2,70 @@
22

33
.. _ddp-optimizations:
44

5-
*****************
5+
#################
66
DDP Optimizations
7-
*****************
7+
#################
88

9+
Tune settings specific to DDP training for increased speed and memory efficiency.
10+
11+
12+
----
913

10-
DDP Static Graph
11-
================
1214

13-
`DDP static graph <https://pytorch.org/blog/pytorch-1.11-released/#stable-ddp-static-graph>`__ assumes that your model
14-
employs the same set of used/unused parameters in every iteration, so that it can deterministically know the flow of
15-
training and apply special optimizations during runtime.
15+
***********************
16+
Gradient as Bucket View
17+
***********************
18+
19+
Enabling ``gradient_as_bucket_view=True`` in the ``DDPStrategy`` will make gradients views point to different offsets of the ``allreduce`` communication buckets.
20+
See :class:`~torch.nn.parallel.DistributedDataParallel` for more information.
21+
This can reduce peak memory usage and throughput as saved memory will be equal to the total gradient memory + removes the need to copy gradients to the ``allreduce`` communication buckets.
22+
23+
.. code-block:: python
24+
25+
import lightning as L
26+
from lightning.pytorch.strategies import DDPStrategy
27+
28+
model = MyModel()
29+
trainer = L.Trainer(devices=4, strategy=DDPStrategy(gradient_as_bucket_view=True))
30+
trainer.fit(model)
1631
1732
.. note::
18-
DDP static graph support requires PyTorch>=1.11.0
33+
When ``gradient_as_bucket_view=True`` you cannot call ``detach_()`` on gradients.
34+
35+
36+
----
37+
38+
39+
****************
40+
DDP Static Graph
41+
****************
42+
43+
`DDP static graph <https://pytorch.org/blog/pytorch-1.11-released/#stable-ddp-static-graph>`__ assumes that your model employs the same set of used/unused parameters in every iteration, so that it can deterministically know the flow of training and apply special optimizations during runtime.
1944

2045
.. code-block:: python
2146
22-
from lightning.pytorch import Trainer
47+
import lightning as L
2348
from lightning.pytorch.strategies import DDPStrategy
2449
25-
trainer = Trainer(devices=4, strategy=DDPStrategy(static_graph=True))
50+
trainer = L.Trainer(devices=4, strategy=DDPStrategy(static_graph=True))
51+
2652
53+
----
2754

28-
When Using DDP on a Multi-node Cluster, Set NCCL Parameters
29-
===========================================================
3055

31-
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library that is used by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/Lightning-AI/lightning/issues/7179>`__. In the issue, we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
56+
********************************************
57+
On a Multi-Node Cluster, Set NCCL Parameters
58+
********************************************
3259

60+
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library that is used by PyTorch to handle communication across nodes and GPUs.
61+
There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/Lightning-AI/lightning/issues/7179>`__.
62+
In the issue, we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
3363
NCCL parameters can be adjusted via environment variables.
3464

3565
.. note::
3666

37-
AWS and GCP already set default values for these on their clusters. This is typically useful for custom cluster setups.
67+
AWS and GCP already set default values for these on their clusters.
68+
This is typically useful for custom cluster setups.
3869

3970
* `NCCL_NSOCKS_PERTHREAD <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nsocks-perthread>`__
4071
* `NCCL_SOCKET_NTHREADS <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-nthreads>`__
@@ -46,42 +77,25 @@ NCCL parameters can be adjusted via environment variables.
4677
export NCCL_SOCKET_NTHREADS=2
4778
4879
49-
Gradients as Bucket View
50-
========================
51-
52-
Enabling ``gradient_as_bucket_view=True`` in the ``DDPStrategy`` will make gradients views point to different offsets of the ``allreduce`` communication buckets. See :class:`~torch.nn.parallel.DistributedDataParallel` for more information.
53-
54-
This can reduce peak memory usage and throughput as saved memory will be equal to the total gradient memory + removes the need to copy gradients to the ``allreduce`` communication buckets.
55-
56-
.. note::
57-
58-
When ``gradient_as_bucket_view=True`` you cannot call ``detach_()`` on gradients. If hitting such errors, please fix it by referring to the :meth:`~torch.optim.Optimizer.zero_grad` function in ``torch/optim/optimizer.py`` as a solution (`source <https://pytorch.org/docs/master/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel>`__).
59-
60-
.. code-block:: python
61-
62-
from lightning.pytorch import Trainer
63-
from lightning.pytorch.strategies import DDPStrategy
64-
65-
model = MyModel()
66-
trainer = Trainer(accelerator="gpu", devices=4, strategy=DDPStrategy(gradient_as_bucket_view=True))
67-
trainer.fit(model)
80+
----
6881

6982

83+
***********************
7084
DDP Communication Hooks
71-
=======================
72-
73-
DDP Communication hooks is an interface to control how gradients are communicated across workers, overriding the standard allreduce in DistributedDataParallel. This allows you to enable performance improving communication hooks when using multiple nodes.
85+
***********************
7486

87+
DDP Communication hooks is an interface to control how gradients are communicated across workers, overriding the standard allreduce in :class:`~torch.nn.parallel.DistributedDataParallel`.
88+
This allows you to enable performance improving communication hooks when using multiple nodes.
7589
Enable `FP16 Compress Hook for multi-node throughput improvement <https://pytorch.org/docs/stable/ddp_comm_hooks.html#torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook>`__:
7690

7791
.. code-block:: python
7892
79-
from lightning.pytorch import Trainer
93+
import lightning as L
8094
from lightning.pytorch.strategies import DDPStrategy
8195
from torch.distributed.algorithms.ddp_comm_hooks import default_hooks as default
8296
8397
model = MyModel()
84-
trainer = Trainer(accelerator="gpu", devices=4, strategy=DDPStrategy(ddp_comm_hook=default.fp16_compress_hook))
98+
trainer = L.Trainer(accelerator="gpu", devices=4, strategy=DDPStrategy(ddp_comm_hook=default.fp16_compress_hook))
8599
trainer.fit(model)
86100
87101
Enable `PowerSGD for multi-node throughput improvement <https://pytorch.org/docs/stable/ddp_comm_hooks.html#powersgd-communication-hook>`__:
@@ -92,12 +106,12 @@ Enable `PowerSGD for multi-node throughput improvement <https://pytorch.org/docs
92106

93107
.. code-block:: python
94108
95-
from lightning.pytorch import Trainer
109+
import lightning as L
96110
from lightning.pytorch.strategies import DDPStrategy
97111
from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook as powerSGD
98112
99113
model = MyModel()
100-
trainer = Trainer(
114+
trainer = L.Trainer(
101115
accelerator="gpu",
102116
devices=4,
103117
strategy=DDPStrategy(
@@ -116,15 +130,15 @@ Combine hooks for accumulated benefit:
116130

117131
.. code-block:: python
118132
119-
from lightning.pytorch import Trainer
133+
import lightning as L
120134
from lightning.pytorch.strategies import DDPStrategy
121135
from torch.distributed.algorithms.ddp_comm_hooks import (
122136
default_hooks as default,
123137
powerSGD_hook as powerSGD,
124138
)
125139
126140
model = MyModel()
127-
trainer = Trainer(
141+
trainer = L.Trainer(
128142
accelerator="gpu",
129143
devices=4,
130144
strategy=DDPStrategy(
@@ -144,12 +158,12 @@ When using Post-localSGD, you must also pass ``model_averaging_period`` to allow
144158

145159
.. code-block:: python
146160
147-
from lightning.pytorch import Trainer
161+
import lightning as L
148162
from lightning.pytorch.strategies import DDPStrategy
149163
from torch.distributed.algorithms.ddp_comm_hooks import post_localSGD_hook as post_localSGD
150164
151165
model = MyModel()
152-
trainer = Trainer(
166+
trainer = L.Trainer(
153167
accelerator="gpu",
154168
devices=4,
155169
strategy=DDPStrategy(

docs/source-pytorch/advanced/model_init.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
.. _model_init:
44

5-
************************
5+
########################
66
Efficient initialization
7-
************************
7+
########################
88

99
Instantiating a ``nn.Module`` in PyTorch creates all parameters on CPU in float32 precision by default.
1010
To speed up initialization, you can force PyTorch to create the model directly on the target device and with the desired precision without changing your model code.

0 commit comments

Comments
 (0)