Skip to content

Commit 8657340

Browse files
committed
more fixes
1 parent 8ca260d commit 8657340

File tree

4 files changed

+8
-7
lines changed

4 files changed

+8
-7
lines changed

docs/source/en/api/models/autoencoder_kl_hunyuan_video.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ specific language governing permissions and limitations under the License. -->
1111

1212
# AutoencoderKLHunyuanVideo
1313

14-
The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo](https://github.com/Tencent/HunyuanVideo/), which was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent.
14+
The 3D variational autoencoder (VAE) model with KL loss used in [HunyuanVideo](https://github.com/hunyuanvideo-community/HunyuanVideo/), which was introduced in [HunyuanVideo: A Systematic Framework For Large Video Generative Models](https://huggingface.co/papers/2412.03603) by Tencent.
1515

1616
The model can be loaded with the following code snippet.
1717

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
[HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
1818

19-
*Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/Tencent/HunyuanVideo).*
19+
*Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/hunyuanvideo-community/HunyuanVideo).*
2020

2121
<Tip>
2222

@@ -30,7 +30,7 @@ Recommendations for inference:
3030
- VAE should be in `torch.float16`.
3131
- `num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
3232
- For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
33-
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
33+
- For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/hunyuanvideo-community/HunyuanVideo/).
3434

3535
## Quantization
3636

docs/source/en/using-diffusers/text-img2vid.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,18 +70,18 @@ export_to_video(video, "output.mp4", fps=8)
7070
> [!TIP]
7171
> HunyuanVideo is a 13B parameter model and requires a lot of memory. Refer to the HunyuanVideo [Quantization](../api/pipelines/hunyuan_video#quantization) guide to learn how to quantize the model. CogVideoX and LTX-Video are more lightweight options that can still generate high-quality videos.
7272
73-
[HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo) features a dual-stream to single-stream diffusion transformer (DiT) for learning video and text tokens separately, and then subsequently concatenating the video and text tokens to combine their information. A single multimodal large language model (MLLM) serves as the text encoder, and videos are also spatio-temporally compressed with a 3D causal VAE.
73+
[HunyuanVideo](https://huggingface.co/hunyuanvideo-community/HunyuanVideo) features a dual-stream to single-stream diffusion transformer (DiT) for learning video and text tokens separately, and then subsequently concatenating the video and text tokens to combine their information. A single multimodal large language model (MLLM) serves as the text encoder, and videos are also spatio-temporally compressed with a 3D causal VAE.
7474

7575
```py
7676
import torch
7777
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
7878
from diffusers.utils import export_to_video
7979

8080
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
81-
"tencent/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
81+
"hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
8282
)
8383
pipe = HunyuanVideoPipeline.from_pretrained(
84-
"tencent/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
84+
"hunyuanvideo-community/HunyuanVideo", transformer=transformer, torch_dtype=torch.float16
8585
)
8686

8787
# reduce memory requirements

src/diffusers/models/transformers/transformer_hunyuan_video.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -504,7 +504,8 @@ def forward(
504504

505505
class HunyuanVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
506506
r"""
507-
A Transformer model for video-like data used in [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo).
507+
A Transformer model for video-like data used in
508+
[HunyuanVideo](https://huggingface.co/hunyuanvideo-community/HunyuanVideo).
508509
509510
Args:
510511
in_channels (`int`, defaults to `16`):

0 commit comments

Comments
 (0)