You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/advanced/advanced_gpu.rst
+41-6Lines changed: 41 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -170,12 +170,16 @@ Below is an example of using both ``wrap`` and ``auto_wrap`` to create your mode
170
170
FairScale Activation Checkpointing
171
171
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
172
172
173
-
Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass. They are then re-computed for the backwards pass as needed.
173
+
Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass. They are then re-computed for the backwards pass as needed. Activation checkpointing is very useful when you have intermediate layers that produce large activations.
174
174
175
175
FairScales' checkpointing wrapper also handles batch norm layers correctly unlike the PyTorch implementation, ensuring stats are tracked correctly due to the multiple forward passes.
176
176
177
177
This saves memory when training larger models however requires wrapping modules you'd like to use activation checkpointing on. See `here <https://fairscale.readthedocs.io/en/latest/api/nn/misc/checkpoint_activations.html>`__ for more information.
178
178
179
+
.. warning::
180
+
181
+
Ensure to not wrap the entire model with activation checkpointing. This is not the intended usage of activation checkpointing, and will lead to failures as seen in `this discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/9144>`__.
182
+
179
183
.. code-block:: python
180
184
181
185
from pytorch_lightning import Trainer
@@ -185,7 +189,8 @@ This saves memory when training larger models however requires wrapping modules
Activation checkpointing frees activations from memory as soon as they are not needed during the forward pass.
516
521
They are then re-computed for the backwards pass as needed.
517
522
518
-
This saves memory when training larger models however requires using a checkpoint function to run the module as shown below.
523
+
Activation checkpointing is very useful when you have intermediate layers that produce large activations.
524
+
525
+
This saves memory when training larger models, however requires using a checkpoint function to run modules as shown below.
526
+
527
+
.. warning::
528
+
529
+
Ensure to not wrap the entire model with activation checkpointing. This is not the intended usage of activation checkpointing, and will lead to failures as seen in `this discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/9144>`__.
530
+
531
+
.. code-block:: python
532
+
533
+
from pytorch_lightning import Trainer
534
+
from pytorch_lightning.plugins import DeepSpeedPlugin
0 commit comments