Skip to content

Commit d7f2e88

Browse files
committed
feedback
1 parent 9b1b40a commit d7f2e88

File tree

2 files changed

+19
-12
lines changed

2 files changed

+19
-12
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,6 @@
7070
title: Reduce memory usage
7171
- local: optimization/speed-memory-optims
7272
title: Compiling and offloading quantized models
73-
- local: api/parallel
74-
title: Parallel inference
7573
- title: Community optimizations
7674
sections:
7775
- local: optimization/pruna
@@ -282,6 +280,8 @@
282280
title: Outputs
283281
- local: api/quantization
284282
title: Quantization
283+
- local: api/parallel
284+
title: Parallel inference
285285
- title: Modular
286286
sections:
287287
- local: api/modular_diffusers/pipeline

docs/source/en/training/distributed_inference.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -240,18 +240,21 @@ By selectively loading and unloading the models you need at a given stage and sh
240240

241241
## Context parallelism
242242

243-
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) reduces memory by splitting input sequences across multiple GPUs. Each GPU processes its own slice of the sequence.
243+
[Context parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) splits input sequences across multiple GPUs to reduce memory usage. Each GPU processes its own slice of the sequence.
244244

245-
The key (K) and value (V) representations are communicated between devices with [Ring Attention](https://huggingface.co/papers/2310.01889) to ensure each split can see every other token's K/V. In Ring Attention, each GPU computes attention for it's local K/V and passes it to the next GPU in the ring. This way, no single GPU has to hold the full sequence and reduces communication latency.
245+
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
246246

247-
Call [`parallelize`] on the model and pass a [`ContextParallelConfig`]. This config supports the `ring_degree` argument which determines the number of devices to use for Ring Attention.
247+
Call [`parallelize`] on the model and pass a [`ContextParallelConfig`]. The config supports the `ring_degree` argument that determines how many devices to use for Ring Attention.
248248

249-
Use the [`~ModelMixin.set_attention_backend`] method to use a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
249+
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized [attention backend](../optimization/attention_backends). The example below uses the FlashAttention backend.
250250

251-
Pass your pipelines to [`~ModelMixin.enable_parallelism`] as a context manager to activate and coordinate context parallelism.
251+
Refer to the table below for the supported attention backends enabled by [`~ModelMixin.set_attention_backend`].
252252

253-
> [!TIP]
254-
> Context parallelism currently supports the cuDNN, FlashAttention-2, and SageAttention backends.
253+
| attention family | support type |
254+
|---|---|
255+
| native cuDNN | inference and training |
256+
| FlashAttention-2/3 | inference and training |
257+
| SageAttention | inference |
255258

256259
```py
257260
import torch
@@ -267,7 +270,11 @@ try:
267270

268271
pipeline.transformer.parallelize(config=ContextParallelConfig(ring_degree=2))
269272
pipeline.transformer.set_attention_backend("flash")
270-
273+
```
274+
275+
Pass your pipeline to [`~ModelMixin.enable_parallelism`] as a context manager to activate and coordinate context parallelism.
276+
277+
```py
271278
prompt = """
272279
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
273280
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
@@ -293,9 +300,9 @@ finally:
293300

294301
### Ulysses Attention
295302

296-
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* (every device sends/receives data to every other device) so that each GPU ends up with all the tokens for only a subset of the attention heads. Each GPU computes attention locally on all tokens for its head and then performs another all-to-all to regroup the results by tokens, making it ready for the next layer.
303+
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* communication (every device sends/receives data to every other device). Each GPU ends up with all tokens for only a subset of attention heads. Each GPU computes attention locally on all tokens for its head, then performs another all-to-all to regroup results by tokens for the next layer.
297304

298-
[`ContextParallelConfig`] also supports Ulysses Attention through the `ulysses_degree` argument. This determines the number of devices to use for Ulysses Attention.
305+
[`ContextParallelConfig`] supports Ulysses Attention through the `ulysses_degree` argument. This determines how many devices to use for Ulysses Attention.
299306

300307
```py
301308
pipeline.transformer.parallelize(config=ContextParallelConfig(ulysses_degree=2))

0 commit comments

Comments
 (0)