Skip to content

Commit 3f44fa1

Browse files
authored
Merge branch 'main' into remote-vae-wan-decode
2 parents a1eacb3 + 3be6706 commit 3f44fa1

File tree

59 files changed

+4696
-254
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

59 files changed

+4696
-254
lines changed

.github/workflows/benchmark.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ jobs:
3838
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
3939
python -m uv pip install -e [quality,test]
4040
python -m uv pip install pandas peft
41+
python -m uv pip uninstall transformers && python -m uv pip install transformers==4.48.0
4142
- name: Environment
4243
run: |
4344
python utils/print_env.py

.github/workflows/nightly_tests.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -414,12 +414,16 @@ jobs:
414414
config:
415415
- backend: "bitsandbytes"
416416
test_location: "bnb"
417+
additional_deps: ["peft"]
417418
- backend: "gguf"
418419
test_location: "gguf"
420+
additional_deps: []
419421
- backend: "torchao"
420422
test_location: "torchao"
423+
additional_deps: []
421424
- backend: "optimum_quanto"
422425
test_location: "quanto"
426+
additional_deps: []
423427
runs-on:
424428
group: aws-g6e-xlarge-plus
425429
container:
@@ -437,6 +441,9 @@ jobs:
437441
python -m venv /opt/venv && export PATH="/opt/venv/bin:$PATH"
438442
python -m uv pip install -e [quality,test]
439443
python -m uv pip install -U ${{ matrix.config.backend }}
444+
if [ "${{ join(matrix.config.additional_deps, ' ') }}" != "" ]; then
445+
python -m uv pip install ${{ join(matrix.config.additional_deps, ' ') }}
446+
fi
440447
python -m uv pip install pytest-reportlog
441448
- name: Environment
442449
run: |

.github/workflows/pr_tests_gpu.yml

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,51 @@ env:
2828
PIPELINE_USAGE_CUTOFF: 1000000000 # set high cutoff so that only always-test pipelines run
2929

3030
jobs:
31+
check_code_quality:
32+
runs-on: ubuntu-22.04
33+
steps:
34+
- uses: actions/checkout@v3
35+
- name: Set up Python
36+
uses: actions/setup-python@v4
37+
with:
38+
python-version: "3.8"
39+
- name: Install dependencies
40+
run: |
41+
python -m pip install --upgrade pip
42+
pip install .[quality]
43+
- name: Check quality
44+
run: make quality
45+
- name: Check if failure
46+
if: ${{ failure() }}
47+
run: |
48+
echo "Quality check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make style && make quality'" >> $GITHUB_STEP_SUMMARY
49+
50+
check_repository_consistency:
51+
needs: check_code_quality
52+
runs-on: ubuntu-22.04
53+
steps:
54+
- uses: actions/checkout@v3
55+
- name: Set up Python
56+
uses: actions/setup-python@v4
57+
with:
58+
python-version: "3.8"
59+
- name: Install dependencies
60+
run: |
61+
python -m pip install --upgrade pip
62+
pip install .[quality]
63+
- name: Check repo consistency
64+
run: |
65+
python utils/check_copies.py
66+
python utils/check_dummies.py
67+
python utils/check_support_list.py
68+
make deps_table_check_updated
69+
- name: Check if failure
70+
if: ${{ failure() }}
71+
run: |
72+
echo "Repo consistency check failed. Please ensure the right dependency versions are installed with 'pip install -e .[quality]' and run 'make fix-copies'" >> $GITHUB_STEP_SUMMARY
73+
3174
setup_torch_cuda_pipeline_matrix:
75+
needs: [check_code_quality, check_repository_consistency]
3276
name: Setup Torch Pipelines CUDA Slow Tests Matrix
3377
runs-on:
3478
group: aws-general-8-plus

docs/source/en/api/pipelines/ltx_video.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,12 @@ export_to_video(video, "ship.mp4", fps=24)
196196
- all
197197
- __call__
198198

199+
## LTXConditionPipeline
200+
201+
[[autodoc]] LTXConditionPipeline
202+
- all
203+
- __call__
204+
199205
## LTXPipelineOutput
200206

201207
[[autodoc]] pipelines.ltx.pipeline_output.LTXPipelineOutput

docs/source/en/api/pipelines/lumina.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,10 @@ Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fa
5858
First, load the pipeline:
5959

6060
```python
61-
from diffusers import LuminaText2ImgPipeline
61+
from diffusers import LuminaPipeline
6262
import torch
6363

64-
pipeline = LuminaText2ImgPipeline.from_pretrained(
64+
pipeline = LuminaPipeline.from_pretrained(
6565
"Alpha-VLLM/Lumina-Next-SFT-diffusers", torch_dtype=torch.bfloat16
6666
).to("cuda")
6767
```
@@ -86,11 +86,11 @@ image = pipeline(prompt="Upper body of a young woman in a Victorian-era outfit w
8686

8787
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
8888

89-
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LuminaText2ImgPipeline`] for inference with bitsandbytes.
89+
Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`LuminaPipeline`] for inference with bitsandbytes.
9090

9191
```py
9292
import torch
93-
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, Transformer2DModel, LuminaText2ImgPipeline
93+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, Transformer2DModel, LuminaPipeline
9494
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
9595

9696
quant_config = BitsAndBytesConfig(load_in_8bit=True)
@@ -109,7 +109,7 @@ transformer_8bit = Transformer2DModel.from_pretrained(
109109
torch_dtype=torch.float16,
110110
)
111111

112-
pipeline = LuminaText2ImgPipeline.from_pretrained(
112+
pipeline = LuminaPipeline.from_pretrained(
113113
"Alpha-VLLM/Lumina-Next-SFT-diffusers",
114114
text_encoder=text_encoder_8bit,
115115
transformer=transformer_8bit,
@@ -122,9 +122,9 @@ image = pipeline(prompt).images[0]
122122
image.save("lumina.png")
123123
```
124124

125-
## LuminaText2ImgPipeline
125+
## LuminaPipeline
126126

127-
[[autodoc]] LuminaText2ImgPipeline
127+
[[autodoc]] LuminaPipeline
128128
- all
129129
- __call__
130130

docs/source/en/api/pipelines/lumina2.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,14 @@ Single file loading for Lumina Image 2.0 is available for the `Lumina2Transforme
3636

3737
```python
3838
import torch
39-
from diffusers import Lumina2Transformer2DModel, Lumina2Text2ImgPipeline
39+
from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline
4040

4141
ckpt_path = "https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0/blob/main/consolidated.00-of-01.pth"
4242
transformer = Lumina2Transformer2DModel.from_single_file(
4343
ckpt_path, torch_dtype=torch.bfloat16
4444
)
4545

46-
pipe = Lumina2Text2ImgPipeline.from_pretrained(
46+
pipe = Lumina2Pipeline.from_pretrained(
4747
"Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
4848
)
4949
pipe.enable_model_cpu_offload()
@@ -60,7 +60,7 @@ image.save("lumina-single-file.png")
6060
GGUF Quantized checkpoints for the `Lumina2Transformer2DModel` can be loaded via `from_single_file` with the `GGUFQuantizationConfig`
6161

6262
```python
63-
from diffusers import Lumina2Transformer2DModel, Lumina2Text2ImgPipeline, GGUFQuantizationConfig
63+
from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline, GGUFQuantizationConfig
6464

6565
ckpt_path = "https://huggingface.co/calcuis/lumina-gguf/blob/main/lumina2-q4_0.gguf"
6666
transformer = Lumina2Transformer2DModel.from_single_file(
@@ -69,7 +69,7 @@ transformer = Lumina2Transformer2DModel.from_single_file(
6969
torch_dtype=torch.bfloat16,
7070
)
7171

72-
pipe = Lumina2Text2ImgPipeline.from_pretrained(
72+
pipe = Lumina2Pipeline.from_pretrained(
7373
"Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
7474
)
7575
pipe.enable_model_cpu_offload()
@@ -80,8 +80,8 @@ image = pipe(
8080
image.save("lumina-gguf.png")
8181
```
8282

83-
## Lumina2Text2ImgPipeline
83+
## Lumina2Pipeline
8484

85-
[[autodoc]] Lumina2Text2ImgPipeline
85+
[[autodoc]] Lumina2Pipeline
8686
- all
8787
- __call__
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# Training CogView4 Control
2+
3+
This (experimental) example shows how to train Control LoRAs with [CogView4](https://huggingface.co/THUDM/CogView4-6B) by conditioning it with additional structural controls (like depth maps, poses, etc.). We provide a script for full fine-tuning, too, refer to [this section](#full-fine-tuning). To know more about CogView4 Control family, refer to the following resources:
4+
5+
To incorporate additional condition latents, we expand the input features of CogView-4 from 64 to 128. The first 64 channels correspond to the original input latents to be denoised, while the latter 64 channels correspond to control latents. This expansion happens on the `patch_embed` layer, where the combined latents are projected to the expected feature dimension of rest of the network. Inference is performed using the `CogView4ControlPipeline`.
6+
7+
> [!NOTE]
8+
> **Gated model**
9+
>
10+
> As the model is gated, before using it with diffusers you first need to go to the [CogView4 Hugging Face page](https://huggingface.co/THUDM/CogView4-6B), fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate. Use the command below to log in:
11+
12+
```bash
13+
huggingface-cli login
14+
```
15+
16+
The example command below shows how to launch fine-tuning for pose conditions. The dataset ([`raulc0399/open_pose_controlnet`](https://huggingface.co/datasets/raulc0399/open_pose_controlnet)) being used here already has the pose conditions of the original images, so we don't have to compute them.
17+
18+
```bash
19+
accelerate launch train_control_lora_cogview4.py \
20+
--pretrained_model_name_or_path="THUDM/CogView4-6B" \
21+
--dataset_name="raulc0399/open_pose_controlnet" \
22+
--output_dir="pose-control-lora" \
23+
--mixed_precision="bf16" \
24+
--train_batch_size=1 \
25+
--rank=64 \
26+
--gradient_accumulation_steps=4 \
27+
--gradient_checkpointing \
28+
--use_8bit_adam \
29+
--learning_rate=1e-4 \
30+
--report_to="wandb" \
31+
--lr_scheduler="constant" \
32+
--lr_warmup_steps=0 \
33+
--max_train_steps=5000 \
34+
--validation_image="openpose.png" \
35+
--validation_prompt="A couple, 4k photo, highly detailed" \
36+
--offload \
37+
--seed="0" \
38+
--push_to_hub
39+
```
40+
41+
`openpose.png` comes from [here](https://huggingface.co/Adapter/t2iadapter/resolve/main/openpose.png).
42+
43+
You need to install `diffusers` from the branch of [this PR](https://github.com/huggingface/diffusers/pull/9999). When it's merged, you should install `diffusers` from the `main`.
44+
45+
The training script exposes additional CLI args that might be useful to experiment with:
46+
47+
* `use_lora_bias`: When set, additionally trains the biases of the `lora_B` layer.
48+
* `train_norm_layers`: When set, additionally trains the normalization scales. Takes care of saving and loading.
49+
* `lora_layers`: Specify the layers you want to apply LoRA to. If you specify "all-linear", all the linear layers will be LoRA-attached.
50+
51+
### Training with DeepSpeed
52+
53+
It's possible to train with [DeepSpeed](https://github.com/microsoft/DeepSpeed), specifically leveraging the Zero2 system optimization. To use it, save the following config to an YAML file (feel free to modify as needed):
54+
55+
```yaml
56+
compute_environment: LOCAL_MACHINE
57+
debug: false
58+
deepspeed_config:
59+
gradient_accumulation_steps: 1
60+
gradient_clipping: 1.0
61+
offload_optimizer_device: cpu
62+
offload_param_device: cpu
63+
zero3_init_flag: false
64+
zero_stage: 2
65+
distributed_type: DEEPSPEED
66+
downcast_bf16: 'no'
67+
enable_cpu_affinity: false
68+
machine_rank: 0
69+
main_training_function: main
70+
mixed_precision: bf16
71+
num_machines: 1
72+
num_processes: 1
73+
rdzv_backend: static
74+
same_network: true
75+
tpu_env: []
76+
tpu_use_cluster: false
77+
tpu_use_sudo: false
78+
use_cpu: false
79+
```
80+
81+
And then while launching training, pass the config file:
82+
83+
```bash
84+
accelerate launch --config_file=CONFIG_FILE.yaml ...
85+
```
86+
87+
### Inference
88+
89+
The pose images in our dataset were computed using the [`controlnet_aux`](https://github.com/huggingface/controlnet_aux) library. Let's install it first:
90+
91+
```bash
92+
pip install controlnet_aux
93+
```
94+
95+
And then we are ready:
96+
97+
```py
98+
from controlnet_aux import OpenposeDetector
99+
from diffusers import CogView4ControlPipeline
100+
from diffusers.utils import load_image
101+
from PIL import Image
102+
import numpy as np
103+
import torch
104+
105+
pipe = CogView4ControlPipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16).to("cuda")
106+
pipe.load_lora_weights("...") # change this.
107+
108+
open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators")
109+
110+
# prepare pose condition.
111+
url = "https://huggingface.co/Adapter/t2iadapter/resolve/main/people.jpg"
112+
image = load_image(url)
113+
image = open_pose(image, detect_resolution=512, image_resolution=1024)
114+
image = np.array(image)[:, :, ::-1]
115+
image = Image.fromarray(np.uint8(image))
116+
117+
prompt = "A couple, 4k photo, highly detailed"
118+
119+
gen_images = pipe(
120+
prompt=prompt,
121+
control_image=image,
122+
num_inference_steps=50,
123+
joint_attention_kwargs={"scale": 0.9},
124+
guidance_scale=25.,
125+
).images[0]
126+
gen_images.save("output.png")
127+
```
128+
129+
## Full fine-tuning
130+
131+
We provide a non-LoRA version of the training script `train_control_cogview4.py`. Here is an example command:
132+
133+
```bash
134+
accelerate launch --config_file=accelerate_ds2.yaml train_control_cogview4.py \
135+
--pretrained_model_name_or_path="THUDM/CogView4-6B" \
136+
--dataset_name="raulc0399/open_pose_controlnet" \
137+
--output_dir="pose-control" \
138+
--mixed_precision="bf16" \
139+
--train_batch_size=2 \
140+
--dataloader_num_workers=4 \
141+
--gradient_accumulation_steps=4 \
142+
--gradient_checkpointing \
143+
--use_8bit_adam \
144+
--proportion_empty_prompts=0.2 \
145+
--learning_rate=5e-5 \
146+
--adam_weight_decay=1e-4 \
147+
--report_to="wandb" \
148+
--lr_scheduler="cosine" \
149+
--lr_warmup_steps=1000 \
150+
--checkpointing_steps=1000 \
151+
--max_train_steps=10000 \
152+
--validation_steps=200 \
153+
--validation_image "2_pose_1024.jpg" "3_pose_1024.jpg" \
154+
--validation_prompt "two friends sitting by each other enjoying a day at the park, full hd, cinematic" "person enjoying a day at the park, full hd, cinematic" \
155+
--offload \
156+
--seed="0" \
157+
--push_to_hub
158+
```
159+
160+
Change the `validation_image` and `validation_prompt` as needed.
161+
162+
For inference, this time, we will run:
163+
164+
```py
165+
from controlnet_aux import OpenposeDetector
166+
from diffusers import CogView4ControlPipeline, CogView4Transformer2DModel
167+
from diffusers.utils import load_image
168+
from PIL import Image
169+
import numpy as np
170+
import torch
171+
172+
transformer = CogView4Transformer2DModel.from_pretrained("...") # change this.
173+
pipe = CogView4ControlPipeline.from_pretrained(
174+
"THUDM/CogView4-6B", transformer=transformer, torch_dtype=torch.bfloat16
175+
).to("cuda")
176+
177+
open_pose = OpenposeDetector.from_pretrained("lllyasviel/Annotators")
178+
179+
# prepare pose condition.
180+
url = "https://huggingface.co/Adapter/t2iadapter/resolve/main/people.jpg"
181+
image = load_image(url)
182+
image = open_pose(image, detect_resolution=512, image_resolution=1024)
183+
image = np.array(image)[:, :, ::-1]
184+
image = Image.fromarray(np.uint8(image))
185+
186+
prompt = "A couple, 4k photo, highly detailed"
187+
188+
gen_images = pipe(
189+
prompt=prompt,
190+
control_image=image,
191+
num_inference_steps=50,
192+
guidance_scale=25.,
193+
).images[0]
194+
gen_images.save("output.png")
195+
```
196+
197+
## Things to note
198+
199+
* The scripts provided in this directory are experimental and educational. This means we may have to tweak things around to get good results on a given condition. We believe this is best done with the community 🤗
200+
* The scripts are not memory-optimized but we offload the VAE and the text encoders to CPU when they are not used if `--offload` is specified.
201+
* We can extract LoRAs from the fully fine-tuned model. While we currently don't provide any utilities for that, users are welcome to refer to [this script](https://github.com/Stability-AI/stability-ComfyUI-nodes/blob/master/control_lora_create.py) that provides a similar functionality.

0 commit comments

Comments
 (0)