Skip to content

Commit 364659b

Browse files
authored
Merge branch 'main' into vae-tests
2 parents bfe23ff + f6f7afa commit 364659b

File tree

78 files changed

+3590
-304
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+3590
-304
lines changed

docker/diffusers-onnxruntime-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2828
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
2929
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3030
python3.10 -m uv pip install --no-cache-dir \
31-
"torch<2.5.0" \
31+
torch \
3232
torchvision \
3333
torchaudio \
3434
"onnxruntime-gpu>=1.13.1" \

docker/diffusers-pytorch-compile-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m uv pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \

docker/diffusers-pytorch-cpu/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m uv pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark \

docker/diffusers-pytorch-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m uv pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \

docker/diffusers-pytorch-xformers-cuda/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ ENV PATH="/opt/venv/bin:$PATH"
2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
3030
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
3131
python3.10 -m pip install --no-cache-dir \
32-
"torch<2.5.0" \
32+
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@
5555
- sections:
5656
- local: using-diffusers/overview_techniques
5757
title: Overview
58+
- local: using-diffusers/create_a_server
59+
title: Create a server
5860
- local: training/distributed_inference
5961
title: Distributed inference
6062
- local: using-diffusers/merge_loras

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -29,16 +29,29 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m
2929

3030
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
3131

32-
There are two models available that can be used with the text-to-video and video-to-video CogVideoX pipelines:
33-
- [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b): The recommended dtype for running this model is `fp16`.
34-
- [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b): The recommended dtype for running this model is `bf16`.
35-
36-
There is one model available that can be used with the image-to-video CogVideoX pipeline:
37-
- [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V): The recommended dtype for running this model is `bf16`.
38-
39-
There are two models that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team):
40-
- [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose): The recommended dtype for running this model is `bf16`.
41-
- [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose): The recommended dtype for running this model is `bf16`.
32+
There are three official CogVideoX checkpoints for text-to-video and video-to-video.
33+
| checkpoints | recommended inference dtype |
34+
|---|---|
35+
| [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b) | torch.float16 |
36+
| [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b) | torch.bfloat16 |
37+
| [`THUDM/CogVideoX1.5-5b`](https://huggingface.co/THUDM/CogVideoX1.5-5b) | torch.bfloat16 |
38+
39+
There are two official CogVideoX checkpoints available for image-to-video.
40+
| checkpoints | recommended inference dtype |
41+
|---|---|
42+
| [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V) | torch.bfloat16 |
43+
| [`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V) | torch.bfloat16 |
44+
45+
For the CogVideoX 1.5 series:
46+
- Text-to-video (T2V) works best at a resolution of 1360x768 because it was trained with that specific resolution.
47+
- Image-to-video (I2V) works for multiple resolutions. The width can vary from 768 to 1360, but the height must be 768. The height/width must be divisible by 16.
48+
- Both T2V and I2V models support generation with 81 and 161 frames and work best at this value. Exporting videos at 16 FPS is recommended.
49+
50+
There are two official CogVideoX checkpoints that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team).
51+
| checkpoints | recommended inference dtype |
52+
|---|---|
53+
| [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose) | torch.bfloat16 |
54+
| [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose) | torch.bfloat16 |
4255

4356
## Inference
4457

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
2+
# Create a server
3+
4+
Diffusers' pipelines can be used as an inference engine for a server. It supports concurrent and multithreaded requests to generate images that may be requested by multiple users at the same time.
5+
6+
This guide will show you how to use the [`StableDiffusion3Pipeline`] in a server, but feel free to use any pipeline you want.
7+
8+
9+
Start by navigating to the `examples/server` folder and installing all of the dependencies.
10+
11+
```py
12+
pip install .
13+
pip install -f requirements.txt
14+
```
15+
16+
Launch the server with the following command.
17+
18+
```py
19+
python server.py
20+
```
21+
22+
The server is accessed at http://localhost:8000. You can curl this model with the following command.
23+
```
24+
curl -X POST -H "Content-Type: application/json" --data '{"model": "something", "prompt": "a kitten in front of a fireplace"}' http://localhost:8000/v1/images/generations
25+
```
26+
27+
If you need to upgrade some dependencies, you can use either [pip-tools](https://github.com/jazzband/pip-tools) or [uv](https://github.com/astral-sh/uv). For example, upgrade the dependencies with `uv` using the following command.
28+
29+
```
30+
uv pip compile requirements.in -o requirements.txt
31+
```
32+
33+
34+
The server is built with [FastAPI](https://fastapi.tiangolo.com/async/). The endpoint for `v1/images/generations` is shown below.
35+
```py
36+
@app.post("/v1/images/generations")
37+
async def generate_image(image_input: TextToImageInput):
38+
try:
39+
loop = asyncio.get_event_loop()
40+
scheduler = shared_pipeline.pipeline.scheduler.from_config(shared_pipeline.pipeline.scheduler.config)
41+
pipeline = StableDiffusion3Pipeline.from_pipe(shared_pipeline.pipeline, scheduler=scheduler)
42+
generator = torch.Generator(device="cuda")
43+
generator.manual_seed(random.randint(0, 10000000))
44+
output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, generator = generator))
45+
logger.info(f"output: {output}")
46+
image_url = save_image(output.images[0])
47+
return {"data": [{"url": image_url}]}
48+
except Exception as e:
49+
if isinstance(e, HTTPException):
50+
raise e
51+
elif hasattr(e, 'message'):
52+
raise HTTPException(status_code=500, detail=e.message + traceback.format_exc())
53+
raise HTTPException(status_code=500, detail=str(e) + traceback.format_exc())
54+
```
55+
The `generate_image` function is defined as asynchronous with the [async](https://fastapi.tiangolo.com/async/) keyword so that FastAPI knows that whatever is happening in this function won't necessarily return a result right away. Once it hits some point in the function that it needs to await some other [Task](https://docs.python.org/3/library/asyncio-task.html#asyncio.Task), the main thread goes back to answering other HTTP requests. This is shown in the code below with the [await](https://fastapi.tiangolo.com/async/#async-and-await) keyword.
56+
```py
57+
output = await loop.run_in_executor(None, lambda: pipeline(image_input.prompt, generator = generator))
58+
```
59+
At this point, the execution of the pipeline function is placed onto a [new thread](https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.run_in_executor), and the main thread performs other things until a result is returned from the `pipeline`.
60+
61+
Another important aspect of this implementation is creating a `pipeline` from `shared_pipeline`. The goal behind this is to avoid loading the underlying model more than once onto the GPU while still allowing for each new request that is running on a separate thread to have its own generator and scheduler. The scheduler, in particular, is not thread-safe, and it will cause errors like: `IndexError: index 21 is out of bounds for dimension 0 with size 21` if you try to use the same scheduler across multiple threads.

examples/advanced_diffusion_training/train_dreambooth_lora_flux_advanced.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2154,6 +2154,7 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
21542154

21552155
# encode batch prompts when custom prompts are provided for each image -
21562156
if train_dataset.custom_instance_prompts:
2157+
elems_to_repeat = 1
21572158
if freeze_text_encoder:
21582159
prompt_embeds, pooled_prompt_embeds, text_ids = compute_text_embeddings(
21592160
prompts, text_encoders, tokenizers
@@ -2168,17 +2169,21 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
21682169
max_sequence_length=args.max_sequence_length,
21692170
add_special_tokens=add_special_tokens_t5,
21702171
)
2172+
else:
2173+
elems_to_repeat = len(prompts)
21712174

21722175
if not freeze_text_encoder:
21732176
prompt_embeds, pooled_prompt_embeds, text_ids = encode_prompt(
21742177
text_encoders=[text_encoder_one, text_encoder_two],
21752178
tokenizers=[None, None],
2176-
text_input_ids_list=[tokens_one, tokens_two],
2179+
text_input_ids_list=[
2180+
tokens_one.repeat(elems_to_repeat, 1),
2181+
tokens_two.repeat(elems_to_repeat, 1),
2182+
],
21772183
max_sequence_length=args.max_sequence_length,
21782184
device=accelerator.device,
21792185
prompt=prompts,
21802186
)
2181-
21822187
# Convert images to latent space
21832188
if args.cache_latents:
21842189
model_input = latents_cache[step].sample()
@@ -2371,6 +2376,9 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
23712376
epoch=epoch,
23722377
torch_dtype=weight_dtype,
23732378
)
2379+
images = None
2380+
del pipeline
2381+
23742382
if freeze_text_encoder:
23752383
del text_encoder_one, text_encoder_two
23762384
free_memory()
@@ -2448,6 +2456,8 @@ def get_sigmas(timesteps, n_dim=4, dtype=torch.float32):
24482456
commit_message="End of training",
24492457
ignore_patterns=["step_*", "epoch_*"],
24502458
)
2459+
images = None
2460+
del pipeline
24512461

24522462
accelerator.end_training()
24532463

0 commit comments

Comments
 (0)