You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source-pytorch/advanced/ddp_optimizations.rst
+58-44Lines changed: 58 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,39 +2,70 @@
2
2
3
3
.. _ddp-optimizations:
4
4
5
-
*****************
5
+
#################
6
6
DDP Optimizations
7
-
*****************
7
+
#################
8
8
9
+
Tune settings specific to DDP training for increased speed and memory efficiency.
10
+
11
+
12
+
----
9
13
10
-
DDP Static Graph
11
-
================
12
14
13
-
`DDP static graph <https://pytorch.org/blog/pytorch-1.11-released/#stable-ddp-static-graph>`__ assumes that your model
14
-
employs the same set of used/unused parameters in every iteration, so that it can deterministically know the flow of
15
-
training and apply special optimizations during runtime.
15
+
***********************
16
+
Gradient as Bucket View
17
+
***********************
18
+
19
+
Enabling ``gradient_as_bucket_view=True`` in the ``DDPStrategy`` will make gradients views point to different offsets of the ``allreduce`` communication buckets.
20
+
See :class:`~torch.nn.parallel.DistributedDataParallel` for more information.
21
+
This can reduce peak memory usage and throughput as saved memory will be equal to the total gradient memory + removes the need to copy gradients to the ``allreduce`` communication buckets.
22
+
23
+
.. code-block:: python
24
+
25
+
import lightning as L
26
+
from lightning.pytorch.strategies import DDPStrategy
When ``gradient_as_bucket_view=True`` you cannot call ``detach_()`` on gradients.
34
+
35
+
36
+
----
37
+
38
+
39
+
****************
40
+
DDP Static Graph
41
+
****************
42
+
43
+
`DDP static graph <https://pytorch.org/blog/pytorch-1.11-released/#stable-ddp-static-graph>`__ assumes that your model employs the same set of used/unused parameters in every iteration, so that it can deterministically know the flow of training and apply special optimizations during runtime.
19
44
20
45
.. code-block:: python
21
46
22
-
from lightning.pytorch import Trainer
47
+
import lightningas L
23
48
from lightning.pytorch.strategies import DDPStrategy
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library that is used by PyTorch to handle communication across nodes and GPUs. There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/Lightning-AI/lightning/issues/7179>`__. In the issue, we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
56
+
********************************************
57
+
On a Multi-Node Cluster, Set NCCL Parameters
58
+
********************************************
32
59
60
+
`NCCL <https://developer.nvidia.com/nccl>`__ is the NVIDIA Collective Communications Library that is used by PyTorch to handle communication across nodes and GPUs.
61
+
There are reported benefits in terms of speedups when adjusting NCCL parameters as seen in this `issue <https://github.com/Lightning-AI/lightning/issues/7179>`__.
62
+
In the issue, we see a 30% speed improvement when training the Transformer XLM-RoBERTa and a 15% improvement in training with Detectron2.
33
63
NCCL parameters can be adjusted via environment variables.
34
64
35
65
.. note::
36
66
37
-
AWS and GCP already set default values for these on their clusters. This is typically useful for custom cluster setups.
67
+
AWS and GCP already set default values for these on their clusters.
68
+
This is typically useful for custom cluster setups.
@@ -46,42 +77,25 @@ NCCL parameters can be adjusted via environment variables.
46
77
export NCCL_SOCKET_NTHREADS=2
47
78
48
79
49
-
Gradients as Bucket View
50
-
========================
51
-
52
-
Enabling ``gradient_as_bucket_view=True`` in the ``DDPStrategy`` will make gradients views point to different offsets of the ``allreduce`` communication buckets. See :class:`~torch.nn.parallel.DistributedDataParallel` for more information.
53
-
54
-
This can reduce peak memory usage and throughput as saved memory will be equal to the total gradient memory + removes the need to copy gradients to the ``allreduce`` communication buckets.
55
-
56
-
.. note::
57
-
58
-
When ``gradient_as_bucket_view=True`` you cannot call ``detach_()`` on gradients. If hitting such errors, please fix it by referring to the :meth:`~torch.optim.Optimizer.zero_grad` function in ``torch/optim/optimizer.py`` as a solution (`source <https://pytorch.org/docs/master/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel>`__).
59
-
60
-
.. code-block:: python
61
-
62
-
from lightning.pytorch import Trainer
63
-
from lightning.pytorch.strategies import DDPStrategy
DDP Communication hooks is an interface to control how gradients are communicated across workers, overriding the standard allreduce in DistributedDataParallel. This allows you to enable performance improving communication hooks when using multiple nodes.
85
+
***********************
74
86
87
+
DDP Communication hooks is an interface to control how gradients are communicated across workers, overriding the standard allreduce in :class:`~torch.nn.parallel.DistributedDataParallel`.
88
+
This allows you to enable performance improving communication hooks when using multiple nodes.
75
89
Enable `FP16 Compress Hook for multi-node throughput improvement <https://pytorch.org/docs/stable/ddp_comm_hooks.html#torch.distributed.algorithms.ddp_comm_hooks.default_hooks.fp16_compress_hook>`__:
76
90
77
91
.. code-block:: python
78
92
79
-
from lightning.pytorch import Trainer
93
+
import lightningas L
80
94
from lightning.pytorch.strategies import DDPStrategy
81
95
from torch.distributed.algorithms.ddp_comm_hooks import default_hooks as default
Copy file name to clipboardExpand all lines: docs/source-pytorch/advanced/model_init.rst
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,9 +2,9 @@
2
2
3
3
.. _model_init:
4
4
5
-
************************
5
+
########################
6
6
Efficient initialization
7
-
************************
7
+
########################
8
8
9
9
Instantiating a ``nn.Module`` in PyTorch creates all parameters on CPU in float32 precision by default.
10
10
To speed up initialization, you can force PyTorch to create the model directly on the target device and with the desired precision without changing your model code.
0 commit comments