Skip to content

Commit a298915

Browse files
committed
Merge branch 'mochi-quality' of https://github.com/huggingface/diffusers into mochi-quality
2 parents 77f9d19 + b904325 commit a298915

File tree

178 files changed

+11500
-777
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

178 files changed

+11500
-777
lines changed

docker/diffusers-onnxruntime-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2828
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
2929
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3030
python3.10 -m uv pip install --no-cache-dir \
31-
"torch<2.5.0" \
31+
torch \
3232
torchvision \
3333
torchaudio \
3434
"onnxruntime-gpu>=1.13.1" \

docker/diffusers-pytorch-compile-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m uv pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \

docker/diffusers-pytorch-cpu/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m uv pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark \

docker/diffusers-pytorch-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m uv pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \

docker/diffusers-pytorch-xformers-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@
5555
- sections:
5656
- local: using-diffusers/overview_techniques
5757
title: Overview
58+
- local: using-diffusers/create_a_server
59+
title: Create a server
5860
- local: training/distributed_inference
5961
title: Distributed inference
6062
- local: using-diffusers/merge_loras

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,16 +29,32 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m
2929

3030
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
3131

32-
There are two models available that can be used with the text-to-video and video-to-video CogVideoX pipelines:
33-
- [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b): The recommended dtype for running this model is `fp16`.
34-
- [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b): The recommended dtype for running this model is `bf16`.
32+
There are three official CogVideoX checkpoints for text-to-video and video-to-video.
3533

36-
There is one model available that can be used with the image-to-video CogVideoX pipeline:
37-
- [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V): The recommended dtype for running this model is `bf16`.
34+
| checkpoints | recommended inference dtype |
35+
|:---:|:---:|
36+
| [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b) | torch.float16 |
37+
| [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b) | torch.bfloat16 |
38+
| [`THUDM/CogVideoX1.5-5b`](https://huggingface.co/THUDM/CogVideoX1.5-5b) | torch.bfloat16 |
3839

39-
There are two models that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team):
40-
- [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose): The recommended dtype for running this model is `bf16`.
41-
- [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose): The recommended dtype for running this model is `bf16`.
40+
There are two official CogVideoX checkpoints available for image-to-video.
41+
42+
| checkpoints | recommended inference dtype |
43+
|:---:|:---:|
44+
| [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V) | torch.bfloat16 |
45+
| [`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V) | torch.bfloat16 |
46+
47+
For the CogVideoX 1.5 series:
48+
- Text-to-video (T2V) works best at a resolution of 1360x768 because it was trained with that specific resolution.
49+
- Image-to-video (I2V) works for multiple resolutions. The width can vary from 768 to 1360, but the height must be 768. The height/width must be divisible by 16.
50+
- Both T2V and I2V models support generation with 81 and 161 frames and work best at this value. Exporting videos at 16 FPS is recommended.
51+
52+
There are two official CogVideoX checkpoints that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team).
53+
54+
| checkpoints | recommended inference dtype |
55+
|:---:|:---:|
56+
| [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose) | torch.bfloat16 |
57+
| [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose) | torch.bfloat16 |
4258

4359
## Inference
4460

docs/source/en/api/pipelines/controlnet_sd3.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ This controlnet code is mainly implemented by [The InstantX Team](https://huggin
2828
| ControlNet type | Developer | Link |
2929
| -------- | ---------- | ---- |
3030
| Canny | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Canny) |
31+
| Depth | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Depth) |
3132
| Pose | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Pose) |
3233
| Tile | [The InstantX Team](https://huggingface.co/InstantX) | [Link](https://huggingface.co/InstantX/SD3-Controlnet-Tile) |
3334
| Inpainting | [The AlimamaCreative Team](https://huggingface.co/alimama-creative) | [link](https://huggingface.co/alimama-creative/SD3-Controlnet-Inpainting) |

docs/source/en/api/pipelines/flux.md

Lines changed: 161 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,20 @@ Flux can be quite expensive to run on consumer hardware devices. However, you ca
2222

2323
</Tip>
2424

25-
Flux comes in two variants:
25+
Flux comes in the following variants:
2626

27-
* Timestep-distilled (`black-forest-labs/FLUX.1-schnell`)
28-
* Guidance-distilled (`black-forest-labs/FLUX.1-dev`)
27+
| model type | model id |
28+
|:----------:|:--------:|
29+
| Timestep-distilled | [`black-forest-labs/FLUX.1-schnell`](https://huggingface.co/black-forest-labs/FLUX.1-schnell) |
30+
| Guidance-distilled | [`black-forest-labs/FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev) |
31+
| Fill Inpainting/Outpainting (Guidance-distilled) | [`black-forest-labs/FLUX.1-Fill-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev) |
32+
| Canny Control (Guidance-distilled) | [`black-forest-labs/FLUX.1-Canny-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev) |
33+
| Depth Control (Guidance-distilled) | [`black-forest-labs/FLUX.1-Depth-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev) |
34+
| Canny Control (LoRA) | [`black-forest-labs/FLUX.1-Canny-dev-lora`](https://huggingface.co/black-forest-labs/FLUX.1-Canny-dev-lora) |
35+
| Depth Control (LoRA) | [`black-forest-labs/FLUX.1-Depth-dev-lora`](https://huggingface.co/black-forest-labs/FLUX.1-Depth-dev-lora) |
36+
| Redux (Adapter) | [`black-forest-labs/FLUX.1-Redux-dev`](https://huggingface.co/black-forest-labs/FLUX.1-Redux-dev) |
2937

30-
Both checkpoints have slightly difference usage which we detail below.
38+
All checkpoints have different usage which we detail below.
3139

3240
### Timestep-distilled
3341

@@ -77,7 +85,132 @@ out = pipe(
7785
out.save("image.png")
7886
```
7987

88+
### Fill Inpainting/Outpainting
89+
90+
* Flux Fill pipeline does not require `strength` as an input like regular inpainting pipelines.
91+
* It supports both inpainting and outpainting.
92+
93+
```python
94+
import torch
95+
from diffusers import FluxFillPipeline
96+
from diffusers.utils import load_image
97+
98+
image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/cup.png")
99+
mask = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/cup_mask.png")
100+
101+
repo_id = "black-forest-labs/FLUX.1-Fill-dev"
102+
pipe = FluxFillPipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16).to("cuda")
103+
104+
image = pipe(
105+
prompt="a white paper cup",
106+
image=image,
107+
mask_image=mask,
108+
height=1632,
109+
width=1232,
110+
max_sequence_length=512,
111+
generator=torch.Generator("cpu").manual_seed(0)
112+
).images[0]
113+
image.save(f"output.png")
114+
```
115+
116+
### Canny Control
117+
118+
**Note:** `black-forest-labs/Flux.1-Canny-dev` is _not_ a [`ControlNetModel`] model. ControlNet models are a separate component from the UNet/Transformer whose residuals are added to the actual underlying model. Canny Control is an alternate architecture that achieves effectively the same results as a ControlNet model would, by using channel-wise concatenation with input control condition and ensuring the transformer learns structure control by following the condition as closely as possible.
119+
120+
```python
121+
# !pip install -U controlnet-aux
122+
import torch
123+
from controlnet_aux import CannyDetector
124+
from diffusers import FluxControlPipeline
125+
from diffusers.utils import load_image
126+
127+
pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", torch_dtype=torch.bfloat16).to("cuda")
128+
129+
prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
130+
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
131+
132+
processor = CannyDetector()
133+
control_image = processor(control_image, low_threshold=50, high_threshold=200, detect_resolution=1024, image_resolution=1024)
134+
135+
image = pipe(
136+
prompt=prompt,
137+
control_image=control_image,
138+
height=1024,
139+
width=1024,
140+
num_inference_steps=50,
141+
guidance_scale=30.0,
142+
).images[0]
143+
image.save("output.png")
144+
```
145+
146+
### Depth Control
147+
148+
**Note:** `black-forest-labs/Flux.1-Depth-dev` is _not_ a ControlNet model. [`ControlNetModel`] models are a separate component from the UNet/Transformer whose residuals are added to the actual underlying model. Depth Control is an alternate architecture that achieves effectively the same results as a ControlNet model would, by using channel-wise concatenation with input control condition and ensuring the transformer learns structure control by following the condition as closely as possible.
149+
150+
```python
151+
# !pip install git+https://github.com/asomoza/image_gen_aux.git
152+
import torch
153+
from diffusers import FluxControlPipeline, FluxTransformer2DModel
154+
from diffusers.utils import load_image
155+
from image_gen_aux import DepthPreprocessor
156+
157+
pipe = FluxControlPipeline.from_pretrained("black-forest-labs/FLUX.1-Depth-dev", torch_dtype=torch.bfloat16).to("cuda")
158+
159+
prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
160+
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")
161+
162+
processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
163+
control_image = processor(control_image)[0].convert("RGB")
164+
165+
image = pipe(
166+
prompt=prompt,
167+
control_image=control_image,
168+
height=1024,
169+
width=1024,
170+
num_inference_steps=30,
171+
guidance_scale=10.0,
172+
generator=torch.Generator().manual_seed(42),
173+
).images[0]
174+
image.save("output.png")
175+
```
176+
177+
### Redux
178+
179+
* Flux Redux pipeline is an adapter for FLUX.1 base models. It can be used with both flux-dev and flux-schnell, for image-to-image generation.
180+
* You can first use the `FluxPriorReduxPipeline` to get the `prompt_embeds` and `pooled_prompt_embeds`, and then feed them into the `FluxPipeline` for image-to-image generation.
181+
* When use `FluxPriorReduxPipeline` with a base pipeline, you can set `text_encoder=None` and `text_encoder_2=None` in the base pipeline, in order to save VRAM.
182+
183+
```python
184+
import torch
185+
from diffusers import FluxPriorReduxPipeline, FluxPipeline
186+
from diffusers.utils import load_image
187+
device = "cuda"
188+
dtype = torch.bfloat16
189+
190+
191+
repo_redux = "black-forest-labs/FLUX.1-Redux-dev"
192+
repo_base = "black-forest-labs/FLUX.1-dev"
193+
pipe_prior_redux = FluxPriorReduxPipeline.from_pretrained(repo_redux, torch_dtype=dtype).to(device)
194+
pipe = FluxPipeline.from_pretrained(
195+
repo_base,
196+
text_encoder=None,
197+
text_encoder_2=None,
198+
torch_dtype=torch.bfloat16
199+
).to(device)
200+
201+
image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy/img5.png")
202+
pipe_prior_output = pipe_prior_redux(image)
203+
images = pipe(
204+
guidance_scale=2.5,
205+
num_inference_steps=50,
206+
generator=torch.Generator("cpu").manual_seed(0),
207+
**pipe_prior_output,
208+
).images
209+
images[0].save("flux-redux.png")
210+
```
211+
80212
## Running FP16 inference
213+
81214
Flux can generate high-quality images with FP16 (i.e. to accelerate inference on Turing/Volta GPUs) but produces different outputs compared to FP32/BF16. The issue is that some activations in the text encoders have to be clipped when running in FP16, which affects the overall image. Forcing text encoders to run with FP32 inference thus removes this output difference. See [here](https://github.com/huggingface/diffusers/pull/9097#issuecomment-2272292516) for details.
82215

83216
FP16 inference code:
@@ -188,3 +321,27 @@ image.save("flux-fp8-dev.png")
188321
[[autodoc]] FluxControlNetImg2ImgPipeline
189322
- all
190323
- __call__
324+
325+
## FluxControlPipeline
326+
327+
[[autodoc]] FluxControlPipeline
328+
- all
329+
- __call__
330+
331+
## FluxControlImg2ImgPipeline
332+
333+
[[autodoc]] FluxControlImg2ImgPipeline
334+
- all
335+
- __call__
336+
337+
## FluxPriorReduxPipeline
338+
339+
[[autodoc]] FluxPriorReduxPipeline
340+
- all
341+
- __call__
342+
343+
## FluxFillPipeline
344+
345+
[[autodoc]] FluxFillPipeline
346+
- all
347+
- __call__
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
2+
# Create a server
3+
4+
Diffusers' pipelines can be used as an inference engine for a server. It supports concurrent and multithreaded requests to generate images that may be requested by multiple users at the same time.
5+
6+
This guide will show you how to use the [`StableDiffusion3Pipeline`] in a server, but feel free to use any pipeline you want.
7+
8+
9+
Start by navigating to the `examples/server` folder and installing all of the dependencies.
10+
11+
```py
12+
pip install .
13+
pip install -f requirements.txt
14+
```
15+
16+
Launch the server with the following command.
17+
18+
```py
19+
python server.py
20+
```
21+
22+
The server is accessed at http://localhost:8000. You can curl this model with the following command.
23+
```
24+
curl -X POST -H "Content-Type: application/json" --data '{"model": "something", "prompt": "a kitten in front of a fireplace"}' http://localhost:8000/v1/images/generations
25+
```
26+
27+
If you need to upgrade some dependencies, you can use either [pip-tools](https://github.com/jazzband/pip-tools) or [uv](https://github.com/astral-sh/uv). For example, upgrade the dependencies with `uv` using the following command.
28+
29+
```
30+
uv pip compile requirements.in -o requirements.txt
31+
```
32+
33+
34+
The server is built with [FastAPI](https://fastapi.tiangolo.com/async/). The endpoint for `v1/images/generations` is shown below.
35+
```py
36+
@app.post("/v1/images/generations")
37+
async def generate_image(image_input: TextToImageInput):
38+
try:
39+
loop = asyncio.get_event_loop()
40+
scheduler = shared_pipeline.pipeline.scheduler.from_config(shared_pipeline.pipeline.scheduler.config)
41+
pipeline = StableDiffusion3Pipeline.from_pipe(shared_pipeline.pipeline, scheduler=scheduler)
42+
generator = torch.Generator(device="cuda")
43+
generator.manual_seed(random.randint(0, 10000000))
44+
output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, generator = generator))
45+
logger.info(f"output: {output}")
46+
image_url = save_image(output.images[0])
47+
return {"data": [{"url": image_url}]}
48+
except Exception as e:
49+
if isinstance(e, HTTPException):
50+
raise e
51+
elif hasattr(e, 'message'):
52+
raise HTTPException(status_code=500, detail=e.message + traceback.format_exc())
53+
raise HTTPException(status_code=500, detail=str(e) + traceback.format_exc())
54+
```
55+
The `generate_image` function is defined as asynchronous with the [async](https://fastapi.tiangolo.com/async/) keyword so that FastAPI knows that whatever is happening in this function won't necessarily return a result right away. Once it hits some point in the function that it needs to await some other [Task](https://docs.python.org/3/library/asyncio-task.html#asyncio.Task), the main thread goes back to answering other HTTP requests. This is shown in the code below with the [await](https://fastapi.tiangolo.com/async/#async-and-await) keyword.
56+
```py
57+
output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, generator = generator))
58+
```
59+
At this point, the execution of the pipeline function is placed onto a [new thread](https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor), and the main thread performs other things until a result is returned from the `pipeline`.
60+
61+
Another important aspect of this implementation is creating a `pipeline` from `shared_pipeline`. The goal behind this is to avoid loading the underlying model more than once onto the GPU while still allowing for each new request that is running on a separate thread to have its own generator and scheduler. The scheduler, in particular, is not thread-safe, and it will cause errors like: `IndexError: index 21 is out of bounds for dimension 0 with size 21` if you try to use the same scheduler across multiple threads.

0 commit comments

Comments
 (0)