Skip to content

Commit 2b0a7f0

Browse files
authored
Merge branch 'main' into lora-tests-cleanup
2 parents 0697afb + 066ea37 commit 2b0a7f0

File tree

49 files changed

+861
-239
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+861
-239
lines changed

docs/source/en/_toctree.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
- local: using-diffusers/overview_techniques
5757
title: Overview
5858
- local: training/distributed_inference
59-
title: Distributed inference with multiple GPUs
59+
title: Distributed inference
6060
- local: using-diffusers/merge_loras
6161
title: Merge LoRAs
6262
- local: using-diffusers/scheduler_features

docs/source/en/api/schedulers/overview.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,12 @@ Many schedulers are implemented from the [k-diffusion](https://github.com/crowso
4646
| N/A | [`UniPCMultistepScheduler`] | |
4747

4848
## Noise schedules and schedule types
49-
| A1111/k-diffusion | 🤗 Diffusers |
50-
|---------------------|----------------------------------------|
51-
| Karras | init with `use_karras_sigmas=True` |
52-
| sgm_uniform | init with `timestep_spacing="trailing"`|
53-
| simple | init with `timestep_spacing="trailing"`|
49+
| A1111/k-diffusion | 🤗 Diffusers |
50+
|--------------------------|----------------------------------------------------------------------------|
51+
| Karras | init with `use_karras_sigmas=True` |
52+
| sgm_uniform | init with `timestep_spacing="trailing"` |
53+
| simple | init with `timestep_spacing="trailing"` |
54+
| exponential | init with `timestep_spacing="linspace"`, `use_exponential_sigmas=True` |
5455

5556
All schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers.
5657

docs/source/en/community_projects.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,4 +75,8 @@ Happy exploring, and thank you for being part of the Diffusers community!
7575
<td><a href="https://github.com/cumulo-autumn/StreamDiffusion"> StreamDiffusion </a></td>
7676
<td>A Pipeline-Level Solution for Real-Time Interactive Generation</td>
7777
</tr>
78+
<tr style="border-top: 2px solid black">
79+
<td><a href="https://github.com/Netwrck/stable-diffusion-server"> Stable Diffusion Server </a></td>
80+
<td>A server configured for Inpainting/Generation/img2img with one stable diffusion model</td>
81+
</tr>
7882
</table>

docs/source/en/training/distributed_inference.md

Lines changed: 129 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Distributed inference with multiple GPUs
13+
# Distributed inference
1414

1515
On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.
1616

@@ -109,3 +109,131 @@ torchrun run_distributed.py --nproc_per_node=2
109109

110110
> [!TIP]
111111
> You can use `device_map` within a [`DiffusionPipeline`] to distribute its model-level components on multiple devices. Refer to the [Device placement](../tutorials/inference_with_big_models#device-placement) guide to learn more.
112+
113+
## Model sharding
114+
115+
Modern diffusion systems such as [Flux](../api/pipelines/flux) are very large and have multiple models. For example, [Flux.1-Dev](https://hf.co/black-forest-labs/FLUX.1-dev) is made up of two text encoders - [T5-XXL](https://hf.co/google/t5-v1_1-xxl) and [CLIP-L](https://hf.co/openai/clip-vit-large-patch14) - a [diffusion transformer](../api/models/flux_transformer), and a [VAE](../api/models/autoencoderkl). With a model this size, it can be challenging to run inference on consumer GPUs.
116+
117+
Model sharding is a technique that distributes models across GPUs when the models don't fit on a single GPU. The example below assumes two 16GB GPUs are available for inference.
118+
119+
Start by computing the text embeddings with the text encoders. Keep the text encoders on two GPUs by setting `device_map="balanced"`. The `balanced` strategy evenly distributes the model on all available GPUs. Use the `max_memory` parameter to allocate the maximum amount of memory for each text encoder on each GPU.
120+
121+
> [!TIP]
122+
> **Only** load the text encoders for this step! The diffusion transformer and VAE are loaded in a later step to preserve memory.
123+
124+
```py
125+
from diffusers import FluxPipeline
126+
import torch
127+
128+
prompt = "a photo of a dog with cat-like look"
129+
130+
pipeline = FluxPipeline.from_pretrained(
131+
"black-forest-labs/FLUX.1-dev",
132+
transformer=None,
133+
vae=None,
134+
device_map="balanced",
135+
max_memory={0: "16GB", 1: "16GB"},
136+
torch_dtype=torch.bfloat16
137+
)
138+
with torch.no_grad():
139+
print("Encoding prompts.")
140+
prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
141+
prompt=prompt, prompt_2=None, max_sequence_length=512
142+
)
143+
```
144+
145+
Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer.
146+
147+
```py
148+
import gc
149+
150+
def flush():
151+
gc.collect()
152+
torch.cuda.empty_cache()
153+
torch.cuda.reset_max_memory_allocated()
154+
torch.cuda.reset_peak_memory_stats()
155+
156+
del pipeline.text_encoder
157+
del pipeline.text_encoder_2
158+
del pipeline.tokenizer
159+
del pipeline.tokenizer_2
160+
del pipeline
161+
162+
flush()
163+
```
164+
165+
Load the diffusion transformer next which has 12.5B parameters. This time, set `device_map="auto"` to automatically distribute the model across two 16GB GPUs. The `auto` strategy is backed by [Accelerate](https://hf.co/docs/accelerate/index) and available as a part of the [Big Model Inference](https://hf.co/docs/accelerate/concept_guides/big_model_inference) feature. It starts by distributing a model across the fastest device first (GPU) before moving to slower devices like the CPU and hard drive if needed. The trade-off of storing model parameters on slower devices is slower inference latency.
166+
167+
```py
168+
from diffusers import FluxTransformer2DModel
169+
import torch
170+
171+
transformer = FluxTransformer2DModel.from_pretrained(
172+
"black-forest-labs/FLUX.1-dev",
173+
subfolder="transformer",
174+
device_map="auto",
175+
torch_dtype=torch.bfloat16
176+
)
177+
```
178+
179+
> [!TIP]
180+
> At any point, you can try `print(pipeline.hf_device_map)` to see how the various models are distributed across devices. This is useful for tracking the device placement of the models.
181+
182+
Add the transformer model to the pipeline for denoising, but set the other model-level components like the text encoders and VAE to `None` because you don't need them yet.
183+
184+
```py
185+
pipeline = FluxPipeline.from_pretrained(
186+
"black-forest-labs/FLUX.1-dev", ,
187+
text_encoder=None,
188+
text_encoder_2=None,
189+
tokenizer=None,
190+
tokenizer_2=None,
191+
vae=None,
192+
transformer=transformer,
193+
torch_dtype=torch.bfloat16
194+
)
195+
196+
print("Running denoising.")
197+
height, width = 768, 1360
198+
latents = pipeline(
199+
prompt_embeds=prompt_embeds,
200+
pooled_prompt_embeds=pooled_prompt_embeds,
201+
num_inference_steps=50,
202+
guidance_scale=3.5,
203+
height=height,
204+
width=width,
205+
output_type="latent",
206+
).images
207+
```
208+
209+
Remove the pipeline and transformer from memory as they're no longer needed.
210+
211+
```py
212+
del pipeline.transformer
213+
del pipeline
214+
215+
flush()
216+
```
217+
218+
Finally, decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU.
219+
220+
```py
221+
from diffusers import AutoencoderKL
222+
from diffusers.image_processor import VaeImageProcessor
223+
import torch
224+
225+
vae = AutoencoderKL.from_pretrained(ckpt_id, subfolder="vae", torch_dtype=torch.bfloat16).to("cuda")
226+
vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
227+
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
228+
229+
with torch.no_grad():
230+
print("Running decoding.")
231+
latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
232+
latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor
233+
234+
image = vae.decode(latents, return_dict=False)[0]
235+
image = image_processor.postprocess(image, output_type="pil")
236+
image[0].save("split_transformer.png")
237+
```
238+
239+
By selectively loading and unloading the models you need at a given stage and sharding the largest models across multiple GPUs, it is possible to run inference with large models on consumer GPUs.

docs/source/en/using-diffusers/callback.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,14 +171,13 @@ def latents_to_rgb(latents):
171171
weights = (
172172
(60, -60, 25, -70),
173173
(60, -5, 15, -50),
174-
(60, 10, -5, -35)
174+
(60, 10, -5, -35),
175175
)
176176

177177
weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
178178
biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
179179
rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
180-
image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
181-
image_array = image_array.transpose(1, 2, 0)
180+
image_array = rgb_tensor.clamp(0, 255).byte().cpu().numpy().transpose(1, 2, 0)
182181

183182
return Image.fromarray(image_array)
184183
```
@@ -189,7 +188,7 @@ def latents_to_rgb(latents):
189188
def decode_tensors(pipe, step, timestep, callback_kwargs):
190189
latents = callback_kwargs["latents"]
191190

192-
image = latents_to_rgb(latents)
191+
image = latents_to_rgb(latents[0])
193192
image.save(f"{step}.png")
194193

195194
return callback_kwargs

examples/community/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ Please also check out our [Community Scripts](https://github.com/huggingface/dif
1010

1111
| Example | Description | Code Example | Colab | Author |
1212
|:--------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------:|
13+
|Flux with CFG|[Flux with CFG](https://github.com/ToTheBeginning/PuLID/blob/main/docs/pulid_for_flux.md) provides an implementation of using CFG in [Flux](https://blackforestlabs.ai/announcing-black-forest-labs/).|[Flux with CFG](#flux-with-cfg)|NA|[Linoy Tsaban](https://github.com/linoytsaban), [Apolinário](https://github.com/apolinario), and [Sayak Paul](https://github.com/sayakpaul)|
1314
|Differential Diffusion|[Differential Diffusion](https://github.com/exx8/differential-diffusion) modifies an image according to a text prompt, and according to a map that specifies the amount of change in each region.|[Differential Diffusion](#differential-diffusion)|[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/exx8/differential-diffusion) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/exx8/differential-diffusion/blob/main/examples/SD2.ipynb)|[Eran Levin](https://github.com/exx8) and [Ohad Fried](https://www.ohadf.com/)|
1415
| HD-Painter | [HD-Painter](https://github.com/Picsart-AI-Research/HD-Painter) enables prompt-faithfull and high resolution (up to 2k) image inpainting upon any diffusion-based image inpainting method. | [HD-Painter](#hd-painter) | [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/PAIR/HD-Painter) | [Manukyan Hayk](https://github.com/haikmanukyan) and [Sargsyan Andranik](https://github.com/AndranikSargsyan) |
1516
| Marigold Monocular Depth Estimation | A universal monocular depth estimator, utilizing Stable Diffusion, delivering sharp predictions in the wild. (See the [project page](https://marigoldmonodepth.github.io) and [full codebase](https://github.com/prs-eth/marigold) for more details.) | [Marigold Depth Estimation](#marigold-depth-estimation) | [![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-yellow)](https://huggingface.co/spaces/toshas/marigold) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12G8reD13DdpMie5ZQlaFNo2WCGeNUH-u?usp=sharing) | [Bingxin Ke](https://github.com/markkua) and [Anton Obukhov](https://github.com/toshas) |
@@ -82,6 +83,36 @@ pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion
8283

8384
## Example usages
8485

86+
### Flux with CFG
87+
88+
Know more about Flux [here](https://blackforestlabs.ai/announcing-black-forest-labs/). Since Flux doesn't use CFG, this implementation provides one, inspired by the [PuLID Flux adaptation](https://github.com/ToTheBeginning/PuLID/blob/main/docs/pulid_for_flux.md).
89+
90+
Example usage:
91+
92+
```py
93+
from diffusers import DiffusionPipeline
94+
import torch
95+
96+
pipeline = DiffusionPipeline.from_pretrained(
97+
"black-forest-labs/FLUX.1-dev",
98+
torch_dtype=torch.bfloat16,
99+
custom_pipeline="pipeline_flux_with_cfg"
100+
)
101+
pipeline.enable_model_cpu_offload()
102+
prompt = "a watercolor painting of a unicorn"
103+
negative_prompt = "pink"
104+
105+
img = pipeline(
106+
prompt=prompt,
107+
negative_prompt=negative_prompt,
108+
true_cfg=1.5,
109+
guidance_scale=3.5,
110+
num_images_per_prompt=1,
111+
generator=torch.manual_seed(0)
112+
).images[0]
113+
img.save("cfg_flux.png")
114+
```
115+
85116
### Differential Diffusion
86117

87118
**Eran Levin, Ohad Fried**

0 commit comments

Comments
 (0)