You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source-fabric/advanced/model_parallel/fsdp.rst
+11-8Lines changed: 11 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,19 +2,19 @@
2
2
Training models with billions of parameters
3
3
###########################################
4
4
5
-
Use Fully Shared Data Parallel (FSDP) to train large models with billions or trillions of parameters efficiently on multiple GPUs and across multiple machines.
5
+
Use Fully Shared Data Parallel (FSDP) to train large models with billions of parameters efficiently on multiple GPUs and across multiple machines.
6
6
7
7
.. note:: This is an experimental feature.
8
8
9
9
10
10
Today, large models with billions of parameters are trained with many GPUs across several machines in parallel.
11
-
Even a single A100 GPU with 80 GB of VRAM (the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
11
+
Even a single H100 GPU with 80 GB of VRAM (the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision).
12
12
The memory consumption for training is generally made up of
13
13
14
14
1. the model parameters,
15
-
2. the optimizer states (e.g., Adam has two additional exponential averages per parameter),
16
-
3. the layer activations (forward) and
17
-
4. the gradients (backward).
15
+
2. the layer activations (forward) and
16
+
3. the gradients (backward).
17
+
4. the optimizer states (e.g., Adam has two additional exponential averages per parameter),
18
18
19
19
|
20
20
@@ -149,7 +149,7 @@ We can specify a list of layer classes in the **wrapping policy** to inform FSDP
149
149
Verify that FSDP works with your model by comparing the peak memory usage printed in the CUDA memory summary (see example above) with regular DDP training.
150
150
You should see a decrease in allocated memory and a slight increase in iteration time:
151
151
152
-
.. list-table::
152
+
.. list-table::Numbers were produced with A100 40GB GPUs, Lightning 2.1 and PyTorch 2.1.
153
153
:widths: 25 25 25
154
154
:header-rows: 1
155
155
@@ -221,7 +221,7 @@ You can configure the following options to trade-off memory for speed:
221
221
222
222
Here is the memory and speed impact for each option when configured in our example code:
223
223
224
-
.. list-table::
224
+
.. list-table::Numbers were produced with A100 40GB GPUs, Lightning 2.1 and PyTorch 2.1.
225
225
:widths: 25 25 25 25 25
226
226
:header-rows: 1
227
227
@@ -275,6 +275,9 @@ This is typically your transformer block (including attention + feed-forward):
275
275
fabric = L.Fabric(..., strategy=strategy)
276
276
277
277
278
+
As in our example, it is typical to set the ``activation_checkpointing_policy`` the same as ``auto_wrap_policy``.
279
+
280
+
278
281
Offload parameters to CPU
279
282
=========================
280
283
@@ -290,7 +293,7 @@ The drawback is a much slower training speed due to the added communication betw
290
293
You should use this only if you have enough CPU memory and other scaling methods don’t give you enough memory savings.
291
294
In our example, we see a 4x memory saving, but a 10x increase in iteration time:
292
295
293
-
.. list-table::
296
+
.. list-table::Numbers were produced with A100 40GB GPUs, Lightning 2.1 and PyTorch 2.1.
0 commit comments