-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Hunyuanvideo15 #12696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hunyuanvideo15 #12696
Changes from 28 commits
8b7ea81
56d57c3
b282ac1
5732d60
76bb607
c739ee9
2f6914d
a0b2fe0
db0127c
38c42b4
090ceb5
e3301cb
753d407
0687a40
c22915d
f9cb82b
e319d72
e194034
3980f97
5029dbf
8aa458e
50abf50
7aeab3f
c3f4598
54f008e
237d318
d7f399d
bdfab30
2c018f8
c715470
0dae8f9
5989014
404d3fa
0869b22
6bfb75a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| <!-- Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # AutoencoderKLHunyuanVideo15 | ||
|
|
||
| The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5) by Tencent. | ||
|
|
||
| The model can be loaded with the following code snippet. | ||
|
|
||
| ```python | ||
| from diffusers import AutoencoderKLHunyuanVideo15 | ||
|
|
||
| vae = AutoencoderKLHunyuanVideo15.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v", subfolder="vae", torch_dtype=torch.float32) | ||
|
|
||
| # make sure to enable tiling to avoid OOM | ||
| vae.enable_tiling() | ||
| ``` | ||
|
|
||
| ## AutoencoderKLHunyuanVideo15 | ||
|
|
||
| [[autodoc]] AutoencoderKLHunyuanVideo15 | ||
| - decode | ||
| - encode | ||
| - all | ||
|
|
||
| ## DecoderOutput | ||
|
|
||
| [[autodoc]] models.autoencoders.vae.DecoderOutput |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| <!-- Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. --> | ||
|
|
||
| # HunyuanVideo15Transformer3DModel | ||
|
|
||
| A Diffusion Transformer model for 3D video-like data used in [HunyuanVideo1.5](https://github.com/Tencent/HunyuanVideo1-1.5). | ||
|
|
||
| The model can be loaded with the following code snippet. | ||
|
|
||
| ```python | ||
| from diffusers import HunyuanVideo15Transformer3DModel | ||
|
|
||
| transformer = HunyuanVideo15Transformer3DModel.from_pretrained("hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v" subfolder="transformer", torch_dtype=torch.bfloat16) | ||
| ``` | ||
|
|
||
| ## HunyuanVideo15Transformer3DModel | ||
|
|
||
| [[autodoc]] HunyuanVideo15Transformer3DModel | ||
|
|
||
| ## Transformer2DModelOutput | ||
|
|
||
| [[autodoc]] models.modeling_outputs.Transformer2DModelOutput |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| <!-- Copyright 2025 The HuggingFace Team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. --> | ||
|
|
||
|
|
||
| # HunyuanVideo-1.5 | ||
|
|
||
| HunyuanVideo-1.5 is a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models. | ||
|
|
||
| You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization. | ||
|
|
||
| > [!TIP] | ||
| > Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks. | ||
| > | ||
| > The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers. | ||
|
|
||
| The example below demonstrates how to generate a video optimized for memory or inference speed. | ||
|
|
||
| <hfoptions id="usage"> | ||
| <hfoption id="memory"> | ||
|
|
||
| Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques. | ||
|
|
||
|
|
||
| ```py | ||
| import torch | ||
| from diffusers import AutoModel, HunyuanVideo15Pipeline | ||
| from diffusers.utils import export_to_video | ||
|
|
||
|
|
||
| pipeline = HunyuanVideo15Pipeline.from_pretrained( | ||
| "HunyuanVideo-1.5-Diffusers-480p_t2v", | ||
| torch_dtype=torch.bfloat16, | ||
| ) | ||
|
|
||
| # model-offloading and tiling | ||
| pipeline.enable_model_cpu_offload() | ||
| pipeline.vae.enable_tiling() | ||
|
|
||
| prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys." | ||
| video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] | ||
| export_to_video(video, "output.mp4", fps=15) | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - HunyuanVideo1.5 use attention masks with avariable-length sequences. For best performance, we recommend using an attention backend that handles padding efficiently. | ||
|
|
||
| - **H100/H800:** `_flash_3_hub` or `_flash_varlen_3` | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay I will work on adding the Hub variant for FA3 varlen so that we can ease the user-experience a bit here. |
||
| - **A100/A800/RTX 4090:** `flash` or `flash_varlen` | ||
yiyixuxu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - **Other GPUs:** `sage` | ||
yiyixuxu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
||
| Refer to the [Attention backends](../../optimization/attention_backends) guide for more details about using a different backend. | ||
|
|
||
|
|
||
| ```py | ||
| pipe.transformer.set_attention_backend("flash_varlen") # or your preferred backend | ||
yiyixuxu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ``` | ||
|
|
||
|
|
||
| ## HunyuanVideo15Pipeline | ||
|
|
||
| [[autodoc]] HunyuanVideo15Pipeline | ||
| - all | ||
| - __call__ | ||
|
|
||
| ## HunyuanVideo15ImageToVideoPipeline | ||
|
|
||
| [[autodoc]] HunyuanVideo15ImageToVideoPipeline | ||
| - all | ||
| - __call__ | ||
|
|
||
| ## HunyuanVideo15PipelineOutput | ||
|
|
||
| [[autodoc]] pipelines.hunyuan_video1_5.pipeline_output.HunyuanVideo15PipelineOutput | ||
Uh oh!
There was an error while loading. Please reload this page.