Skip to content

Commit 3c04cb8

Browse files
chengzeyistevhliu
andauthored
Update docs/source/en/optimization/para_attn.md
Co-authored-by: Steven Liu <[email protected]>
1 parent 03abeda commit 3c04cb8

File tree

1 file changed

+8
-9
lines changed

1 file changed

+8
-9
lines changed

docs/source/en/optimization/para_attn.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -285,15 +285,14 @@ However, models like `FLUX.1-dev` can benefit a lot from quantization and compil
285285
</hfoption>
286286
</hfoptions>
287287

288-
### Context Parallelism
289-
290-
A lot faster than before, right? But we are not satisfied with the speedup we have achieved so far.
291-
If we want to accelerate the inference further, we can use context parallelism to parallelize the inference.
292-
Libraries like [xDit](https://github.com/xdit-project/xDiT) and our [ParaAttention](https://github.com/chengzeyi/ParaAttention) provide ways to scale up the inference with multiple GPUs.
293-
In ParaAttention, we design our API in a compositional way so that we can combine context parallelism with first block cache and dynamic quantization all together.
294-
We provide very detailed instructions and examples of how to scale up the inference with multiple GPUs in our ParaAttention repository.
295-
Users can easily launch the inference with multiple GPUs by calling `torchrun`.
296-
If there is a need to make the inference process persistent and serviceable, it is suggested to use `torch.multiprocessing` to write your own inference processor, which can eliminate the overhead of launching the process and loading and recompiling the model.
288+
## Context Parallelism
289+
290+
Context Parallelism parallelizes inference and scales with multiple GPUs. The ParaAttention compositional design allows you to combine Context Parallelism with First Block Cache and dynamic quantization.
291+
292+
> [!TIP]
293+
> Refer to the [ParaAttention](https://github.com/chengzeyi/ParaAttention/tree/main) repository for detailed instructions and examples of how to scale inference with multiple GPUs.
294+
295+
If the inference process needs to be persistent and serviceable, it is suggested to use [torch.multiprocessing](https://pytorch.org/docs/stable/multiprocessing.html) to write your own inference processor. This can eliminate the overhead of launching the process and loading and recompiling the model.
297296

298297
<hfoptions id="context-parallelism">
299298
<hfoption id="FLUX-1.dev">

0 commit comments

Comments
 (0)