Skip to content

Commit 6df4368

Browse files
authored
Revamp model parallel docs (FSDP) (3/n) (#18326)
1 parent cf8f8ab commit 6df4368

File tree

3 files changed

+392
-83
lines changed

3 files changed

+392
-83
lines changed

docs/source-fabric/advanced/model_parallel/fsdp.rst

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,19 @@
22
Training models with billions of parameters
33
###########################################
44

5-
Use Fully Shared Data Parallel (FSDP) to train large models with billions or trillions of parameters efficiently on multiple GPUs and across multiple machines.
5+
Use Fully Shared Data Parallel (FSDP) to train large models with billions of parameters efficiently on multiple GPUs and across multiple machines.
66

77
.. note:: This is an experimental feature.
88

99

1010
Today, large models with billions of parameters are trained with many GPUs across several machines in parallel.
11-
Even a single A100 GPU with 80 GB of VRAM (the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
11+
Even a single H100 GPU with 80 GB of VRAM (the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
1212
The memory consumption for training is generally made up of
1313

1414
1. the model parameters,
15-
2. the optimizer states (e.g., Adam has two additional exponential averages per parameter),
16-
3. the layer activations (forward) and
17-
4. the gradients (backward).
15+
2. the layer activations (forward) and
16+
3. the gradients (backward).
17+
4. the optimizer states (e.g., Adam has two additional exponential averages per parameter),
1818

1919
|
2020
@@ -149,7 +149,7 @@ We can specify a list of layer classes in the **wrapping policy** to inform FSDP
149149
Verify that FSDP works with your model by comparing the peak memory usage printed in the CUDA memory summary (see example above) with regular DDP training.
150150
You should see a decrease in allocated memory and a slight increase in iteration time:
151151

152-
.. list-table::
152+
.. list-table:: Numbers were produced with A100 40GB GPUs, Lightning 2.1 and PyTorch 2.1.
153153
:widths: 25 25 25
154154
:header-rows: 1
155155

@@ -221,7 +221,7 @@ You can configure the following options to trade-off memory for speed:
221221
222222
Here is the memory and speed impact for each option when configured in our example code:
223223

224-
.. list-table::
224+
.. list-table:: Numbers were produced with A100 40GB GPUs, Lightning 2.1 and PyTorch 2.1.
225225
:widths: 25 25 25 25 25
226226
:header-rows: 1
227227

@@ -275,6 +275,9 @@ This is typically your transformer block (including attention + feed-forward):
275275
fabric = L.Fabric(..., strategy=strategy)
276276
277277
278+
As in our example, it is typical to set the ``activation_checkpointing_policy`` the same as ``auto_wrap_policy``.
279+
280+
278281
Offload parameters to CPU
279282
=========================
280283

@@ -290,7 +293,7 @@ The drawback is a much slower training speed due to the added communication betw
290293
You should use this only if you have enough CPU memory and other scaling methods don’t give you enough memory savings.
291294
In our example, we see a 4x memory saving, but a 10x increase in iteration time:
292295

293-
.. list-table::
296+
.. list-table:: Numbers were produced with A100 40GB GPUs, Lightning 2.1 and PyTorch 2.1.
294297
:widths: 25 25 25 25
295298
:header-rows: 1
296299

0 commit comments

Comments
 (0)