Skip to content

Commit f2d1133

Browse files
committed
init
1 parent 5796735 commit f2d1133

File tree

1 file changed

+31
-152
lines changed

1 file changed

+31
-152
lines changed

docs/source/en/training/distributed_inference.md

Lines changed: 31 additions & 152 deletions
Original file line numberDiff line numberDiff line change
@@ -12,51 +12,55 @@ specific language governing permissions and limitations under the License.
1212

1313
# Distributed inference
1414

15-
On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.
15+
Distributed inference splits the workload across multiple GPUs. It a useful technique for fitting larger models in memory and can process multiple prompts for higher throughput.
1616

17-
This guide will show you how to use 🤗 Accelerate and PyTorch Distributed for distributed inference.
17+
This guide will show you how to use [Accelerate](https://huggingface.co/docs/accelerate/index) and [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) for distributed inference.
1818

19-
## 🤗 Accelerate
19+
## Accelerate
2020

21-
🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
21+
Accelerate is a library designed to simplify inference and training on multiple accelerators by handling the setup, allowing users to focus on their PyTorch code.
2222

23-
To begin, create a Python file and initialize an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `distributed_state.device` to assign a GPU to each process.
23+
Install Accelerate with the following command.
2424

25-
Now use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
25+
```bash
26+
uv pip install accelerate
27+
```
28+
29+
Initialize a [`accelerate.PartialState`] class in a Python file to create a distributed environment. The [`accelerate.PartialState`] class manages process management, device control and distribution, and process coordination.
30+
31+
Move the [`DiffusionPipeline`] to [`accelerate.PartialState.device`] to assign a GPU to each process.
2632

2733
```py
2834
import torch
2935
from accelerate import PartialState
3036
from diffusers import DiffusionPipeline
3137

3238
pipeline = DiffusionPipeline.from_pretrained(
33-
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
39+
"Qwen/Qwen-Image", torch_dtype=torch.float16
3440
)
3541
distributed_state = PartialState()
3642
pipeline.to(distributed_state.device)
43+
```
44+
45+
Use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
3746

47+
```py
3848
with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt:
3949
result = pipeline(prompt).images[0]
4050
result.save(f"result_{distributed_state.process_index}.png")
4151
```
4252

43-
Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
53+
Call `accelerate launch` to run the script and use the `--num_processes` argument to set the number of GPUs to use.
4454

4555
```bash
4656
accelerate launch run_distributed.py --num_processes=2
4757
```
4858

49-
<Tip>
50-
51-
Refer to this minimal example [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for running inference across multiple GPUs. To learn more, take a look at the [Distributed Inference with 🤗 Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
52-
53-
</Tip>
54-
5559
## PyTorch Distributed
5660

57-
PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism.
61+
PyTorch [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) enables [data parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=data_parallelism), which replicates the same model on each device, to process different batches of data in parallel.
5862

59-
To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a [`DiffusionPipeline`]:
63+
Import `torch.distributed` and `torch.multiprocessing` into a Python file to set up the distributed process group and to spawn the processes for inference on each GPU.
6064

6165
```py
6266
import torch
@@ -65,20 +69,20 @@ import torch.multiprocessing as mp
6569

6670
from diffusers import DiffusionPipeline
6771

68-
sd = DiffusionPipeline.from_pretrained(
69-
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
72+
pipeline = DiffusionPipeline.from_pretrained(
73+
"Qwen/Qwen-Image", torch_dtype=torch.float16,
7074
)
7175
```
7276

73-
You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` is 2.
77+
Create a function for inference with [init_process_group](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group). This method creates a distributed environment with the backend type, the `rank` of the current process, and the `world_size` or number of processes participating (for example, 2 GPUs would be `world_size=2`).
7478

75-
Move the [`DiffusionPipeline`] to `rank` and use `get_rank` to assign a GPU to each process, where each process handles a different prompt:
79+
Move the pipeline to `rank` and use `get_rank` to assign a GPU to each process. Each process handles a different prompt.
7680

7781
```py
7882
def run_inference(rank, world_size):
7983
dist.init_process_group("nccl", rank=rank, world_size=world_size)
8084

81-
sd.to(rank)
85+
pipeline.to(rank)
8286

8387
if torch.distributed.get_rank() == 0:
8488
prompt = "a dog"
@@ -89,7 +93,7 @@ def run_inference(rank, world_size):
8993
image.save(f"./{'_'.join(prompt)}.png")
9094
```
9195

92-
To run the distributed inference, call [`mp.spawn`](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn) to run the `run_inference` function on the number of GPUs defined in `world_size`:
96+
Use [mp.spawn](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn) to create the number of processes defined in `world_size`.
9397

9498
```py
9599
def main():
@@ -101,139 +105,14 @@ if __name__ == "__main__":
101105
main()
102106
```
103107

104-
Once you've completed the inference script, use the `--nproc_per_node` argument to specify the number of GPUs to use and call `torchrun` to run the script:
108+
Call `torchrun` to run the inference script and use the `--nproc_per_node` argument to set the number of GPUs to use.
105109

106110
```bash
107111
torchrun run_distributed.py --nproc_per_node=2
108112
```
109113

110-
> [!TIP]
111-
> You can use `device_map` within a [`DiffusionPipeline`] to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more.
112-
113-
## Model sharding
114-
115-
Modern diffusion systems such as [Flux](../api/pipelines/flux) are very large and have multiple models. For example, [Flux.1-Dev](https://hf.co/black-forest-labs/FLUX.1-dev) is made up of two text encoders - [T5-XXL](https://hf.co/google/t5-v1_1-xxl) and [CLIP-L](https://hf.co/openai/clip-vit-large-patch14) - a [diffusion transformer](../api/models/flux_transformer), and a [VAE](../api/models/autoencoderkl). With a model this size, it can be challenging to run inference on consumer GPUs.
116-
117-
Model sharding is a technique that distributes models across GPUs when the models don't fit on a single GPU. The example below assumes two 16GB GPUs are available for inference.
118-
119-
Start by computing the text embeddings with the text encoders. Keep the text encoders on two GPUs by setting `device_map="balanced"`. The `balanced` strategy evenly distributes the model on all available GPUs. Use the `max_memory` parameter to allocate the maximum amount of memory for each text encoder on each GPU.
120-
121-
> [!TIP]
122-
> **Only** load the text encoders for this step! The diffusion transformer and VAE are loaded in a later step to preserve memory.
123-
124-
```py
125-
from diffusers import FluxPipeline
126-
import torch
127-
128-
prompt = "a photo of a dog with cat-like look"
129-
130-
pipeline = FluxPipeline.from_pretrained(
131-
"black-forest-labs/FLUX.1-dev",
132-
transformer=None,
133-
vae=None,
134-
device_map="balanced",
135-
max_memory={0: "16GB", 1: "16GB"},
136-
torch_dtype=torch.bfloat16
137-
)
138-
with torch.no_grad():
139-
print("Encoding prompts.")
140-
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
141-
prompt=prompt, prompt_2=None, max_sequence_length=512
142-
)
143-
```
144-
145-
Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer.
146-
147-
```py
148-
import gc
149-
150-
def flush():
151-
gc.collect()
152-
torch.cuda.empty_cache()
153-
torch.cuda.reset_max_memory_allocated()
154-
torch.cuda.reset_peak_memory_stats()
155-
156-
del pipeline.text_encoder
157-
del pipeline.text_encoder_2
158-
del pipeline.tokenizer
159-
del pipeline.tokenizer_2
160-
del pipeline
161-
162-
flush()
163-
```
164-
165-
Load the diffusion transformer next which has 12.5B parameters. This time, set `device_map="auto"` to automatically distribute the model across two 16GB GPUs. The `auto` strategy is backed by [Accelerate](https://hf.co/docs/accelerate/index) and available as a part of the [Big Model Inference](https://hf.co/docs/accelerate/concept_guides/big_model_inference) feature. It starts by distributing a model across the fastest device first (GPU) before moving to slower devices like the CPU and hard drive if needed. The trade-off of storing model parameters on slower devices is slower inference latency.
166-
167-
```py
168-
from diffusers import AutoModel
169-
import torch
170-
171-
transformer = AutoModel.from_pretrained(
172-
"black-forest-labs/FLUX.1-dev",
173-
subfolder="transformer",
174-
device_map="auto",
175-
torch_dtype=torch.bfloat16
176-
)
177-
```
178-
179-
> [!TIP]
180-
> At any point, you can try `print(pipeline.hf_device_map)` to see how the various models are distributed across devices. This is useful for tracking the device placement of the models. You can also try `print(transformer.hf_device_map)` to see how the transformer model is sharded across devices.
181-
182-
Add the transformer model to the pipeline for denoising, but set the other model-level components like the text encoders and VAE to `None` because you don't need them yet.
183-
184-
```py
185-
pipeline = FluxPipeline.from_pretrained(
186-
"black-forest-labs/FLUX.1-dev",
187-
text_encoder=None,
188-
text_encoder_2=None,
189-
tokenizer=None,
190-
tokenizer_2=None,
191-
vae=None,
192-
transformer=transformer,
193-
torch_dtype=torch.bfloat16
194-
)
195-
196-
print("Running denoising.")
197-
height, width = 768, 1360
198-
latents = pipeline(
199-
prompt_embeds=prompt_embeds,
200-
pooled_prompt_embeds=pooled_prompt_embeds,
201-
num_inference_steps=50,
202-
guidance_scale=3.5,
203-
height=height,
204-
width=width,
205-
output_type="latent",
206-
).images
207-
```
208-
209-
Remove the pipeline and transformer from memory as they're no longer needed.
210-
211-
```py
212-
del pipeline.transformer
213-
del pipeline
214-
215-
flush()
216-
```
217-
218-
Finally, decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU.
219-
220-
```py
221-
from diffusers import AutoencoderKL
222-
from diffusers.image_processor import VaeImageProcessor
223-
import torch
224-
225-
vae = AutoencoderKL.from_pretrained(ckpt_id, subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
226-
vae_scale_factor = 2 ** (len(vae.config.block_out_channels) - 1)
227-
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
228-
229-
with torch.no_grad():
230-
print("Running decoding.")
231-
latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
232-
latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor
233-
234-
image = vae.decode(latents, return_dict=False)[0]
235-
image = image_processor.postprocess(image, output_type="pil")
236-
image[0].save("split_transformer.png")
237-
```
114+
## Resources
238115

239-
By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs.
116+
- Take a look at this [script](https://gist.github.com/sayakpaul/cfaebd221820d7b43fae638b4dfa01ba) for a minimal example of distributed inference with Accelerate.
117+
- For more details, check out Accelerate's [Distributed inference](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
118+
- The `device_map` argument assign models or an entire pipeline to devices. Refer to the [device placement](../using-diffusers/loading#device-placement) docs for more information.

0 commit comments

Comments
 (0)