You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/training/distributed_inference.md
+17-10Lines changed: 17 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -240,18 +240,21 @@ By selectively loading and unloading the models you need at a given stage and sh
240
240
241
241
## Context parallelism
242
242
243
-
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism)reduces memory by splitting input sequences across multiple GPUs. Each GPU processes its own slice of the sequence.
243
+
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism)splits input sequences across multiple GPUs to reduce memory usage. Each GPU processes its own slice of the sequence.
244
244
245
-
The key (K) and value (V) representations are communicated between devices with[Ring Attention](https://huggingface.co/papers/2310.01889) to ensure each split can see every other token's K/V. In Ring Attention, each GPU computes attention for it's local K/V and passes it to the next GPU in the ring. This way, no single GPU has to hold the full sequence and reduces communication latency.
245
+
Key (K) and value (V) representations communicate between devices using[Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
246
246
247
-
Call [`parallelize`] on the model and pass a [`ContextParallelConfig`]. This config supports the `ring_degree` argument which determines the number of devices to use for Ring Attention.
247
+
Call [`parallelize`] on the model and pass a [`ContextParallelConfig`]. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
248
248
249
-
Use the [`~ModelMixin.set_attention_backend`]method to use a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
249
+
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
250
250
251
-
Pass your pipelines to [`~ModelMixin.enable_parallelism`] as a context manager to activate and coordinate context parallelism.
251
+
Refer to the table below for the supported attention backends enabled by [`~ModelMixin.set_attention_backend`].
252
252
253
-
> [!TIP]
254
-
> Context parallelism currently supports the cuDNN, FlashAttention-2, and SageAttention backends.
Pass your pipeline to [`~ModelMixin.enable_parallelism`] as a context manager to activate and coordinate context parallelism.
276
+
277
+
```py
271
278
prompt ="""
272
279
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
273
280
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
@@ -293,9 +300,9 @@ finally:
293
300
294
301
### Ulysses Attention
295
302
296
-
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* (every device sends/receives data to every other device) so that each GPU ends up with all the tokens for only a subset of the attention heads. Each GPU computes attention locally on all tokens for its head and then performs another all-to-all to regroup the results by tokens, making it ready for the next layer.
303
+
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all*communication (every device sends/receives data to every other device). Each GPU ends up with all tokens for only a subset of attention heads. Each GPU computes attention locally on all tokens for its head, then performs another all-to-all to regroup results by tokens for the next layer.
297
304
298
-
[`ContextParallelConfig`]also supports Ulysses Attention through the `ulysses_degree` argument. This determines the number of devices to use for Ulysses Attention.
305
+
[`ContextParallelConfig`] supports Ulysses Attention through the `ulysses_degree` argument. This determines how many devices to use for Ulysses Attention.
0 commit comments