Skip to content

Commit c063247

Browse files
authored
Merge branch 'main' into reuse-attn-mixin
2 parents b3dc1b9 + 5a47442 commit c063247

File tree

78 files changed

+4195
-551
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+4195
-551
lines changed

.github/workflows/push_tests.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ jobs:
7676
run: |
7777
uv pip install -e ".[quality]"
7878
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
79+
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
7980
- name: Environment
8081
run: |
8182
python utils/print_env.py
@@ -127,6 +128,7 @@ jobs:
127128
uv pip install -e ".[quality]"
128129
uv pip install peft@git+https://github.com/huggingface/peft.git
129130
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
131+
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
130132
131133
- name: Environment
132134
run: |
@@ -178,6 +180,7 @@ jobs:
178180
- name: Install dependencies
179181
run: |
180182
uv pip install -e ".[quality,training]"
183+
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
181184
- name: Environment
182185
run: |
183186
python utils/print_env.py

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,8 @@
329329
title: BriaTransformer2DModel
330330
- local: api/models/chroma_transformer
331331
title: ChromaTransformer2DModel
332+
- local: api/models/chronoedit_transformer_3d
333+
title: ChronoEditTransformer3DModel
332334
- local: api/models/cogvideox_transformer3d
333335
title: CogVideoXTransformer3DModel
334336
- local: api/models/cogview3plus_transformer2d
@@ -628,6 +630,8 @@
628630
- sections:
629631
- local: api/pipelines/allegro
630632
title: Allegro
633+
- local: api/pipelines/chronoedit
634+
title: ChronoEdit
631635
- local: api/pipelines/cogvideox
632636
title: CogVideoX
633637
- local: api/pipelines/consisid
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ChronoEditTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data from [ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
15+
16+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
17+
18+
The model can be loaded with the following code snippet.
19+
20+
```python
21+
from diffusers import ChronoEditTransformer3DModel
22+
23+
transformer = ChronoEditTransformer3DModel.from_pretrained("nvidia/ChronoEdit-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
24+
```
25+
26+
## ChronoEditTransformer3DModel
27+
28+
[[autodoc]] ChronoEditTransformer3DModel
29+
30+
## Transformer2DModelOutput
31+
32+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<a href="https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference" target="_blank" rel="noopener">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</a>
20+
</div>
21+
</div>
22+
23+
# ChronoEdit
24+
25+
[ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
26+
27+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
28+
29+
*Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: [this https URL](https://research.nvidia.com/labs/toronto-ai/chronoedit).*
30+
31+
The ChronoEdit pipeline is developed by the ChronoEdit Team. The original code is available on [GitHub](https://github.com/nv-tlabs/ChronoEdit), and pretrained models can be found in the [nvidia/ChronoEdit](https://huggingface.co/collections/nvidia/chronoedit) collection on Hugging Face.
32+
33+
34+
### Image Editing
35+
36+
```py
37+
import torch
38+
import numpy as np
39+
from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline
40+
from diffusers.utils import export_to_video, load_image
41+
from transformers import CLIPVisionModel
42+
from PIL import Image
43+
44+
model_id = "nvidia/ChronoEdit-14B-Diffusers"
45+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
46+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
47+
transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
48+
pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
49+
pipe.to("cuda")
50+
51+
image = load_image(
52+
"https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png"
53+
)
54+
max_area = 720 * 1280
55+
aspect_ratio = image.height / image.width
56+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
57+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
58+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
59+
print("width", width, "height", height)
60+
image = image.resize((width, height))
61+
prompt = (
62+
"The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
63+
"The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
64+
)
65+
66+
output = pipe(
67+
image=image,
68+
prompt=prompt,
69+
height=height,
70+
width=width,
71+
num_frames=5,
72+
num_inference_steps=50,
73+
guidance_scale=5.0,
74+
enable_temporal_reasoning=False,
75+
num_temporal_reasoning_steps=0,
76+
).frames[0]
77+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
78+
```
79+
80+
Optionally, enable **temporal reasoning** for improved physical consistency:
81+
```py
82+
output = pipe(
83+
image=image,
84+
prompt=prompt,
85+
height=height,
86+
width=width,
87+
num_frames=29,
88+
num_inference_steps=50,
89+
guidance_scale=5.0,
90+
enable_temporal_reasoning=True,
91+
num_temporal_reasoning_steps=50,
92+
).frames[0]
93+
export_to_video(output, "output.mp4", fps=16)
94+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
95+
```
96+
97+
### Inference with 8-Step Distillation Lora
98+
99+
```py
100+
import torch
101+
import numpy as np
102+
from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline
103+
from diffusers.utils import export_to_video, load_image
104+
from transformers import CLIPVisionModel
105+
from PIL import Image
106+
107+
model_id = "nvidia/ChronoEdit-14B-Diffusers"
108+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
109+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
110+
transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
111+
pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
112+
lora_path = hf_hub_download(repo_id=model_id, filename="lora/chronoedit_distill_lora.safetensors")
113+
pipe.load_lora_weights(lora_path)
114+
pipe.fuse_lora(lora_scale=1.0)
115+
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=2.0)
116+
pipe.to("cuda")
117+
118+
image = load_image(
119+
"https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png"
120+
)
121+
max_area = 720 * 1280
122+
aspect_ratio = image.height / image.width
123+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
124+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
125+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
126+
print("width", width, "height", height)
127+
image = image.resize((width, height))
128+
prompt = (
129+
"The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
130+
"The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
131+
)
132+
133+
output = pipe(
134+
image=image,
135+
prompt=prompt,
136+
height=height,
137+
width=width,
138+
num_frames=5,
139+
num_inference_steps=8,
140+
guidance_scale=1.0,
141+
enable_temporal_reasoning=False,
142+
num_temporal_reasoning_steps=0,
143+
).frames[0]
144+
export_to_video(output, "output.mp4", fps=16)
145+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
146+
```
147+
148+
## ChronoEditPipeline
149+
150+
[[autodoc]] ChronoEditPipeline
151+
- all
152+
- __call__
153+
154+
## ChronoEditPipelineOutput
155+
156+
[[autodoc]] pipelines.chronoedit.pipeline_output.ChronoEditPipelineOutput

docs/source/en/modular_diffusers/loop_sequential_pipeline_blocks.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# LoopSequentialPipelineBlocks
1414

15-
[`~modular_pipelines.LoopSequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a loop. Data flows circularly, using `intermediate_inputs` and `intermediate_outputs`, and each block is run iteratively. This is typically used to create a denoising loop which is iterative by default.
15+
[`~modular_pipelines.LoopSequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a loop. Data flows circularly, using `inputs` and `intermediate_outputs`, and each block is run iteratively. This is typically used to create a denoising loop which is iterative by default.
1616

1717
This guide shows you how to create [`~modular_pipelines.LoopSequentialPipelineBlocks`].
1818

@@ -21,7 +21,6 @@ This guide shows you how to create [`~modular_pipelines.LoopSequentialPipelineBl
2121
[`~modular_pipelines.LoopSequentialPipelineBlocks`], is also known as the *loop wrapper* because it defines the loop structure, iteration variables, and configuration. Within the loop wrapper, you need the following variables.
2222

2323
- `loop_inputs` are user provided values and equivalent to [`~modular_pipelines.ModularPipelineBlocks.inputs`].
24-
- `loop_intermediate_inputs` are intermediate variables from the [`~modular_pipelines.PipelineState`] and equivalent to [`~modular_pipelines.ModularPipelineBlocks.intermediate_inputs`].
2524
- `loop_intermediate_outputs` are new intermediate variables created by the block and added to the [`~modular_pipelines.PipelineState`]. It is equivalent to [`~modular_pipelines.ModularPipelineBlocks.intermediate_outputs`].
2625
- `__call__` method defines the loop structure and iteration logic.
2726

@@ -90,4 +89,4 @@ Add more loop blocks to run within each iteration with [`~modular_pipelines.Loop
9089

9190
```py
9291
loop = LoopWrapper.from_blocks_dict({"block1": LoopBlock(), "block2": LoopBlock})
93-
```
92+
```

docs/source/en/modular_diffusers/pipeline_block.md

Lines changed: 5 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -37,17 +37,7 @@ A [`~modular_pipelines.ModularPipelineBlocks`] requires `inputs`, and `intermedi
3737
]
3838
```
3939

40-
- `intermediate_inputs` are values typically created from a previous block but it can also be directly provided if no preceding block generates them. Unlike `inputs`, `intermediate_inputs` can be modified.
41-
42-
Use `InputParam` to define `intermediate_inputs`.
43-
44-
```py
45-
user_intermediate_inputs = [
46-
InputParam(name="processed_image", type_hint="torch.Tensor", description="image that has been preprocessed and normalized"),
47-
]
48-
```
49-
50-
- `intermediate_outputs` are new values created by a block and added to the [`~modular_pipelines.PipelineState`]. The `intermediate_outputs` are available as `intermediate_inputs` for subsequent blocks or available as the final output from running the pipeline.
40+
- `intermediate_outputs` are new values created by a block and added to the [`~modular_pipelines.PipelineState`]. The `intermediate_outputs` are available as `inputs` for subsequent blocks or available as the final output from running the pipeline.
5141

5242
Use `OutputParam` to define `intermediate_outputs`.
5343

@@ -65,8 +55,8 @@ The intermediate inputs and outputs share data to connect blocks. They are acces
6555

6656
The computation a block performs is defined in the `__call__` method and it follows a specific structure.
6757

68-
1. Retrieve the [`~modular_pipelines.BlockState`] to get a local view of the `inputs` and `intermediate_inputs`.
69-
2. Implement the computation logic on the `inputs` and `intermediate_inputs`.
58+
1. Retrieve the [`~modular_pipelines.BlockState`] to get a local view of the `inputs`
59+
2. Implement the computation logic on the `inputs`.
7060
3. Update [`~modular_pipelines.PipelineState`] to push changes from the local [`~modular_pipelines.BlockState`] back to the global [`~modular_pipelines.PipelineState`].
7161
4. Return the components and state which becomes available to the next block.
7262

@@ -76,7 +66,7 @@ def __call__(self, components, state):
7666
block_state = self.get_block_state(state)
7767

7868
# Your computation logic here
79-
# block_state contains all your inputs and intermediate_inputs
69+
# block_state contains all your inputs
8070
# Access them like: block_state.image, block_state.processed_image
8171

8272
# Update the pipeline state with your updated block_states
@@ -112,4 +102,4 @@ def __call__(self, components, state):
112102
unet = components.unet
113103
vae = components.vae
114104
scheduler = components.scheduler
115-
```
105+
```

docs/source/en/modular_diffusers/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ from diffusers.modular_pipelines import ComponentsManager
183183
components = ComponentManager()
184184

185185
dd_pipeline = dd_blocks.init_pipeline("YiYiXu/modular-demo-auto", components_manager=components, collection="diffdiff")
186-
dd_pipeline.load_default_componenets(torch_dtype=torch.float16)
186+
dd_pipeline.load_componenets(torch_dtype=torch.float16)
187187
dd_pipeline.to("cuda")
188188
```
189189

docs/source/en/modular_diffusers/sequential_pipeline_blocks.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License.
1212

1313
# SequentialPipelineBlocks
1414

15-
[`~modular_pipelines.SequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a sequence. Data flows linearly from one block to the next using `intermediate_inputs` and `intermediate_outputs`. Each block in [`~modular_pipelines.SequentialPipelineBlocks`] usually represents a step in the pipeline, and by combining them, you gradually build a pipeline.
15+
[`~modular_pipelines.SequentialPipelineBlocks`] are a multi-block type that composes other [`~modular_pipelines.ModularPipelineBlocks`] together in a sequence. Data flows linearly from one block to the next using `inputs` and `intermediate_outputs`. Each block in [`~modular_pipelines.SequentialPipelineBlocks`] usually represents a step in the pipeline, and by combining them, you gradually build a pipeline.
1616

1717
This guide shows you how to connect two blocks into a [`~modular_pipelines.SequentialPipelineBlocks`].
1818

19-
Create two [`~modular_pipelines.ModularPipelineBlocks`]. The first block, `InputBlock`, outputs a `batch_size` value and the second block, `ImageEncoderBlock` uses `batch_size` as `intermediate_inputs`.
19+
Create two [`~modular_pipelines.ModularPipelineBlocks`]. The first block, `InputBlock`, outputs a `batch_size` value and the second block, `ImageEncoderBlock` uses `batch_size` as `inputs`.
2020

2121
<hfoptions id="sequential">
2222
<hfoption id="InputBlock">
@@ -110,4 +110,4 @@ Inspect the sub-blocks in [`~modular_pipelines.SequentialPipelineBlocks`] by cal
110110
```py
111111
print(blocks)
112112
print(blocks.doc)
113-
```
113+
```

examples/community/img2img_inpainting.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ def check_size(image, height, width):
4545
raise ValueError(f"Image size should be {height}x{width}, but got {h}x{w}")
4646

4747

48-
def overlay_inner_image(image, inner_image, paste_offset: Tuple[int] = (0, 0)):
48+
def overlay_inner_image(image, inner_image, paste_offset: Tuple[int, ...] = (0, 0)):
4949
inner_image = inner_image.convert("RGBA")
5050
image = image.convert("RGB")
5151

0 commit comments

Comments
 (0)