You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* improve distributed inference cp docs.
* Apply suggestions from code review
Co-authored-by: Steven Liu <[email protected]>
---------
Co-authored-by: Steven Liu <[email protected]>
Copy file name to clipboardExpand all lines: docs/source/en/training/distributed_inference.md
+67-24Lines changed: 67 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -237,6 +237,8 @@ By selectively loading and unloading the models you need at a given stage and sh
237
237
238
238
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized attention backend. Refer to this [table](../optimization/attention_backends#available-backends) for a complete list of available backends.
239
239
240
+
Most attention backends are compatible with context parallelism. Open an [issue](https://github.com/huggingface/diffusers/issues/new) if a backend is not compatible.
241
+
240
242
### Ring Attention
241
243
242
244
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
@@ -245,40 +247,60 @@ Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transf
245
247
246
248
```py
247
249
import torch
248
-
from diffusers import AutoModel, QwenImagePipeline, ContextParallelConfig
The script above needs to be run with a distributed launcher, such as [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html), that is compatible with PyTorch. `--nproc-per-node` is set to the number of GPUs available.
299
+
300
+
/```shell
301
+
`torchrun --nproc-per-node 2 above_script.py`.
302
+
/```
303
+
282
304
### Ulysses Attention
283
305
284
306
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* communication (every device sends/receives data to every other device). Each GPU ends up with all tokens for only a subset of attention heads. Each GPU computes attention locally on all tokens for its head, then performs another all-to-all to regroup results by tokens for the next layer.
@@ -288,5 +310,26 @@ finally:
288
310
Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
0 commit comments