You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The [Genmo Mochi implementation](https://github.com/genmoai/mochi/tree/main) uses different precision values for each stage in the inference process. The text encoder and VAE use `torch.float32`, while the DiT uses `torch.bfloat16` with the [attention kernel](https://pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) set to `EFFICIENT_ATTENTION`. Diffusers pipelines currently do not support setting different `dtypes` for different stages of the pipeline. In order to run inference in the same way as the the original implementation, please refer to the following example.
75
75
76
+
<Tip>
77
+
THe original Mochi implementation zeros out empty prompts. However, enabling this option and placing the entire pipeline under autocast can lead to numerical overflows with the T5 text encoder.
78
+
79
+
When enabling `force_zeros_for_empty_prompt`, it is recommended to run the text encoding step outside the autocast context in full precision.
80
+
</Tip>
81
+
76
82
<Tip>
77
83
Decoding the latents in full precision is very memory intensive. You will need at least 70GB VRAM to generate the 163 frames
78
84
in this example. To reduce memory, either reduce the number of frames or run the decoding step in `torch.bfloat16`
@@ -86,7 +92,7 @@ from diffusers import MochiPipeline
86
92
from diffusers.utils import export_to_video
87
93
from diffusers.video_processor import VideoProcessor
@@ -135,6 +141,86 @@ video = video_processor.postprocess_video(video)[0]
135
141
export_to_video(video, "mochi.mp4", fps=30)
136
142
```
137
143
144
+
## Running inference with multiple GPUs
145
+
146
+
It is possible to split the large Mochi transformer across multiple GPUs using the `device_map` and `max_memory` options in `from_pretrained`. In the following example we split the model across two GPUs, each with 24GB of VRAM.
147
+
148
+
```python
149
+
import torch
150
+
from diffusers import MochiPipeline, MochiTransformer3DModel
0 commit comments