Skip to content

Commit 5f43c6a

Browse files
committed
docs
1 parent 2798ed4 commit 5f43c6a

File tree

4 files changed

+134
-2
lines changed

4 files changed

+134
-2
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,8 @@
302302
title: AutoencoderKL
303303
- local: api/models/autoencoderkl_cogvideox
304304
title: AutoencoderKLCogVideoX
305+
- local: api/models/autoencoderkl_mochi
306+
title: AutoencoderKLMochi
305307
- local: api/models/asymmetricautoencoderkl
306308
title: AsymmetricAutoencoderKL
307309
- local: api/models/consistency_decoder_vae
@@ -394,6 +396,8 @@
394396
title: Lumina-T2X
395397
- local: api/pipelines/marigold
396398
title: Marigold
399+
- local: api/pipelines/mochi
400+
title: Mochi
397401
- local: api/pipelines/panorama
398402
title: MultiDiffusion
399403
- local: api/pipelines/musicldm
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# AutoencoderKLMochi
13+
14+
The 3D variational autoencoder (VAE) model with KL loss used in [Mochi](https://github.com/genmoai/models) was introduced in [Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) by Tsinghua University & ZhipuAI.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import AutoencoderKLMochi
20+
21+
vae = AutoencoderKLMochi.from_pretrained("genmo/mochi-1-preview", subfolder="vae", torch_dtype=torch.float32).to("cuda")
22+
```
23+
24+
## AutoencoderKLMochi
25+
26+
[[autodoc]] AutoencoderKLMochi
27+
- decode
28+
- encode
29+
- all
30+
31+
## DecoderOutput
32+
33+
[[autodoc]] models.autoencoders.vae.DecoderOutput
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<!-- Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# Mochi
17+
18+
[Mochi 1 Preview](https://huggingface.co/genmo/mochi-1-preview) from Genmo.
19+
20+
*Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. The model is released under a permissive Apache 2.0 license.*
21+
22+
<Tip>
23+
24+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
25+
26+
</Tip>
27+
28+
## MochiPipeline
29+
30+
[[autodoc]] MochiPipeline
31+
- all
32+
- __call__
33+
34+
## MochiPipelineOutput
35+
36+
[[autodoc]] pipelines.mochi.pipeline_output.MochiPipelineOutput

src/diffusers/models/transformers/transformer_mochi.py

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,26 @@
3434

3535
@maybe_allow_in_graph
3636
class MochiTransformerBlock(nn.Module):
37+
r"""
38+
Transformer block used in [Mochi](https://huggingface.co/genmo/mochi-1-preview).
39+
40+
Args:
41+
dim (`int`):
42+
The number of channels in the input and output.
43+
num_attention_heads (`int`):
44+
The number of heads to use for multi-head attention.
45+
attention_head_dim (`int`):
46+
The number of channels in each head.
47+
qk_norm (`str`, defaults to `"rms_norm"`):
48+
The normalization layer to use.
49+
activation_fn (`str`, defaults to `"swiglu"`):
50+
Activation function to use in feed-forward.
51+
context_pre_only (`bool`, defaults to `False`):
52+
Whether or not to process context-related conditions with additional layers.
53+
eps (`float`, defaults to `1e-6`):
54+
Epsilon value for normalization layers.
55+
"""
56+
3757
def __init__(
3858
self,
3959
dim: int,
@@ -42,7 +62,7 @@ def __init__(
4262
pooled_projection_dim: int,
4363
qk_norm: str = "rms_norm",
4464
activation_fn: str = "swiglu",
45-
context_pre_only: bool = True,
65+
context_pre_only: bool = False,
4666
eps: float = 1e-6,
4767
) -> None:
4868
super().__init__()
@@ -82,6 +102,7 @@ def __init__(
82102
elementwise_affine=True,
83103
)
84104

105+
# TODO(aryan): norm_context layers are not needed when `context_pre_only` is True
85106
self.norm2 = RMSNorm(dim, eps=eps, elementwise_affine=False)
86107
self.norm2_context = RMSNorm(pooled_projection_dim, eps=eps, elementwise_affine=False)
87108

@@ -145,7 +166,17 @@ def forward(
145166

146167

147168
class MochiRoPE(nn.Module):
148-
def __init__(self, base_height: int = 192, base_width: int = 192, theta: float = 10000.0) -> None:
169+
r"""
170+
RoPE implementation used in [Mochi](https://huggingface.co/genmo/mochi-1-preview).
171+
172+
Args:
173+
base_height (`int`, defaults to `192`):
174+
Base height used to compute interpolation scale for rotary positional embeddings.
175+
base_width (`int`, defaults to `192`):
176+
Base width used to compute interpolation scale for rotary positional embeddings.
177+
"""
178+
179+
def __init__(self, base_height: int = 192, base_width: int = 192) -> None:
149180
super().__init__()
150181

151182
self.target_area = base_height * base_width
@@ -195,6 +226,34 @@ def forward(
195226

196227
@maybe_allow_in_graph
197228
class MochiTransformer3DModel(ModelMixin, ConfigMixin):
229+
r"""
230+
A Transformer model for video-like data introduced in [Mochi](https://huggingface.co/genmo/mochi-1-preview).
231+
232+
Args:
233+
patch_size (`int`, defaults to `2`):
234+
The size of the patches to use in the patch embedding layer.
235+
num_attention_heads (`int`, defaults to `24`):
236+
The number of heads to use for multi-head attention.
237+
attention_head_dim (`int`, defaults to `128`):
238+
The number of channels in each head.
239+
num_layers (`int`, defaults to `48`):
240+
The number of layers of Transformer blocks to use.
241+
in_channels (`int`, defaults to `12`):
242+
The number of channels in the input.
243+
out_channels (`int`, *optional*, defaults to `None`):
244+
The number of channels in the output.
245+
qk_norm (`str`, defaults to `"rms_norm"`):
246+
The normalization layer to use.
247+
text_embed_dim (`int`, defaults to `4096`):
248+
Input dimension of text embeddings from the text encoder.
249+
time_embed_dim (`int`, defaults to `256`):
250+
Output dimension of timestep embeddings.
251+
activation_fn (`str`, defaults to `"swiglu"`):
252+
Activation function to use in feed-forward.
253+
max_sequence_length (`int`, defaults to `256`):
254+
The maximum sequence length of text embeddings supported.
255+
"""
256+
198257
_supports_gradient_checkpointing = True
199258

200259
@register_to_config

0 commit comments

Comments
 (0)