Skip to content

Commit 4c64987

Browse files
committed
update
1 parent 9519ffc commit 4c64987

File tree

1 file changed

+69
-3
lines changed

1 file changed

+69
-3
lines changed

docs/source/en/api/pipelines/mochi.md

Lines changed: 69 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m
2727

2828
## Generating videos with Mochi-1 Preview
2929

30-
The following example will download the full precision `mochi-1-preview` weights and produce the highest quality results but will require at least 42GB VRAM to run.
30+
The following example will download the full precision `mochi-1-preview` weights and produce the highest quality results but will require at least 42GB VRAM to run.
3131

3232
```python
3333
import torch
@@ -43,7 +43,7 @@ pipe.enable_vae_tiling()
4343
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
4444

4545
with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
46-
frames = pipe(prompt, num_frames=84).frames[0]
46+
frames = pipe(prompt, num_frames=85).frames[0]
4747

4848
export_to_video(frames, "mochi.mp4", fps=30)
4949
```
@@ -64,11 +64,77 @@ pipe.enable_model_cpu_offload()
6464
pipe.enable_vae_tiling()
6565

6666
prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
67-
frames = pipe(prompt, num_frames=84).frames[0]
67+
frames = pipe(prompt, num_frames=85).frames[0]
6868

6969
export_to_video(frames, "mochi.mp4", fps=30)
7070
```
7171

72+
## Reproducing the results from the Genmo Mochi repo
73+
74+
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the the original implementation, please refer to the following example.
75+
76+
<Tip>
77+
Decoding the latents in full precision is very memory intensive. You will need at least 70GB VRAM to generate the 163 frames
78+
in this example. To reduce memory, either reduce the number of frames or run the decoding step in `torch.bfloat16`
79+
</Tip>
80+
81+
```python
82+
import torch
83+
from torch.nn.attention import SDPBackend, sdpa_kernel
84+
85+
from diffusers import MochiPipeline
86+
from diffusers.utils import export_to_video
87+
from diffusers.video_processor import VideoProcessor
88+
89+
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
90+
pipe.enable_vae_tiling()
91+
pipe.enable_model_cpu_offload()
92+
93+
prompt = "An aerial shot of a parade of elephants walking across the African savannah. The camera showcases the herd and the surrounding landscape."
94+
95+
with torch.no_grad():
96+
prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask = (
97+
pipe.encode_prompt(prompt=prompt)
98+
)
99+
100+
with torch.autocast("cuda", torch.bfloat16):
101+
with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
102+
frames = pipe(
103+
prompt_embeds=prompt_embeds,
104+
prompt_attention_mask=prompt_attention_mask,
105+
negative_prompt_embeds=negative_prompt_embeds,
106+
negative_prompt_attention_mask=negative_prompt_attention_mask,
107+
guidance_scale=4.5,
108+
num_inference_steps=64,
109+
height=480,
110+
width=848,
111+
num_frames=163,
112+
generator=torch.Generator("cuda").manual_seed(0),
113+
output_type="latent",
114+
return_dict=False,
115+
)[0]
116+
117+
video_processor = VideoProcessor(vae_scale_factor=8)
118+
has_latents_mean = hasattr(pipe.vae.config, "latents_mean") and pipe.vae.config.latents_mean is not None
119+
has_latents_std = hasattr(pipe.vae.config, "latents_std") and pipe.vae.config.latents_std is not None
120+
if has_latents_mean and has_latents_std:
121+
latents_mean = (
122+
torch.tensor(pipe.vae.config.latents_mean).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype)
123+
)
124+
latents_std = (
125+
torch.tensor(pipe.vae.config.latents_std).view(1, 12, 1, 1, 1).to(frames.device, frames.dtype)
126+
)
127+
frames = frames * latents_std / pipe.vae.config.scaling_factor + latents_mean
128+
else:
129+
frames = frames / pipe.vae.config.scaling_factor
130+
131+
with torch.no_grad():
132+
video = pipe.vae.decode(frames.to(pipe.vae.dtype), return_dict=False)[0]
133+
134+
video = video_processor.postprocess_video(video)[0]
135+
export_to_video(video, "mochi.mp4", fps=30)
136+
```
137+
72138
## MochiPipeline
73139

74140
[[autodoc]] MochiPipeline

0 commit comments

Comments
 (0)