Skip to content

Commit 25aea7d

Browse files
committed
update
1 parent 4c64987 commit 25aea7d

File tree

1 file changed

+87
-1
lines changed

1 file changed

+87
-1
lines changed

docs/source/en/api/pipelines/mochi.md

Lines changed: 87 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,12 @@ export_to_video(frames, "mochi.mp4", fps=30)
7373

7474
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the the original implementation, please refer to the following example.
7575

76+
<Tip>
77+
THe original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder.
78+
79+
When enabling `force_zeros_for_empty_prompt`, it is recommended to run the text encoding step outside the autocast context in full precision.
80+
</Tip>
81+
7682
<Tip>
7783
Decoding the latents in full precision is very memory intensive. You will need at least 70GB VRAM to generate the 163 frames
7884
in this example. To reduce memory, either reduce the number of frames or run the decoding step in `torch.bfloat16`
@@ -86,7 +92,7 @@ from diffusers import MochiPipeline
8692
from diffusers.utils import export_to_video
8793
from diffusers.video_processor import VideoProcessor
8894

89-
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
95+
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", force_zeros_for_empty_prompt=True)
9096
pipe.enable_vae_tiling()
9197
pipe.enable_model_cpu_offload()
9298

@@ -135,6 +141,86 @@ video = video_processor.postprocess_video(video)[0]
135141
export_to_video(video, "mochi.mp4", fps=30)
136142
```
137143

144+
## Running inference with multiple GPUs
145+
146+
It is possible to split the large Mochi transformer across multiple GPUs using the `device_map` and `max_memory` options in `from_pretrained`. In the following example we split the model across two GPUs, each with 24GB of VRAM.
147+
148+
```python
149+
import torch
150+
from diffusers import MochiPipeline, MochiTransformer3DModel
151+
from diffusers.utils import export_to_video
152+
153+
model_id = "genmo/mochi-1-preview"
154+
transformer = MochiTransformer3DModel.from_pretrained(
155+
model_id,
156+
subfolder="transformer",
157+
device_map="auto",
158+
max_memory={0: "24GB", 1: "24GB"}
159+
)
160+
161+
pipe = MochiPipeline.from_pretrained(model_id, transformer=transformer)
162+
pipe.enable_model_cpu_offload()
163+
pipe.enable_vae_tiling()
164+
165+
with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False):
166+
frames = pipe(
167+
prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
168+
negative_prompt="",
169+
height=480,
170+
width=848,
171+
num_frames=85,
172+
num_inference_steps=50,
173+
guidance_scale=4.5,
174+
num_videos_per_prompt=1,
175+
generator=torch.Generator(device="cuda").manual_seed(0),
176+
max_sequence_length=256,
177+
output_type="pil",
178+
).frames[0]
179+
180+
export_to_video(frames, "output.mp4", fps=30)
181+
```
182+
183+
## Using single file loading with the Mochi Transformer
184+
185+
You can use `from_single_file` to load the Mochi transformer in its original format.
186+
187+
<Tip>
188+
Diffusers currently doesn't support using the FP8 scaled versions of the Mochi single file checkpoints.
189+
</Tip>
190+
191+
```python
192+
import torch
193+
from diffusers import MochiPipeline, MochiTransformer3DModel
194+
from diffusers.utils import export_to_video
195+
196+
model_id = "genmo/mochi-1-preview"
197+
198+
ckpt_path = "https://huggingface.co/Comfy-Org/mochi_preview_repackaged/blob/main/split_files/diffusion_models/mochi_preview_bf16.safetensors"
199+
200+
transformer = MochiTransformer3DModel.from_pretrained(ckpt_path, torch_dtype=torch.bfloat16)
201+
202+
pipe = MochiPipeline.from_pretrained(model_id, transformer=transformer)
203+
pipe.enable_model_cpu_offload()
204+
pipe.enable_vae_tiling()
205+
206+
with torch.autocast(device_type="cuda", dtype=torch.bfloat16, cache_enabled=False):
207+
frames = pipe(
208+
prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
209+
negative_prompt="",
210+
height=480,
211+
width=848,
212+
num_frames=85,
213+
num_inference_steps=50,
214+
guidance_scale=4.5,
215+
num_videos_per_prompt=1,
216+
generator=torch.Generator(device="cuda").manual_seed(0),
217+
max_sequence_length=256,
218+
output_type="pil",
219+
).frames[0]
220+
221+
export_to_video(frames, "output.mp4", fps=30)
222+
```
223+
138224
## MochiPipeline
139225

140226
[[autodoc]] MochiPipeline

0 commit comments

Comments
 (0)