Skip to content

Conversation

@stevhliu
Copy link
Member

@stevhliu stevhliu commented Mar 18, 2025

🚧 WIP 🚧

Based on our discussions about making it easier to run video models by including some minimal code optimized for memory and inference speed, this PR refactors the model card (starting with CogVideoX, but eventually expanding to other models as well) to reflect that. This provides users with easy copy/paste code they can run.

Parallel to this effort is to also improve the generic video generation guide.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@hlky
Copy link
Contributor

hlky commented Apr 3, 2025

Looks good on first impression, I will review it in depth later today, wanted to raise #10301 with you as it will help simplify the examples (in combination with #11130 for the quantization cases). Also, it would be cool to have the examples be configurable/update with options, to demonstrate here's an artist's (4o) impression of what it could look like:

ChatGPT Image Apr 3, 2025, 06_51_57 AM

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good start. Left some comments, let me know if they make sense.

|:---:|:---:|
| [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V) | torch.bfloat16 |
| [`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V) | torch.bfloat16 |
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay but I would perhaps tackle the removal of the abstract section in a separate PR. Also, this does add an additional overload of coming up with a description for the paper. I would like to avoid that for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to also tackle this now since for the new pipeline cards, we want to have a nice and complete example of what it should look like no?

Good point that adding a description of the paper adds additional overload, but I think its necessary, since we want to give users a version of the abstract that is more accessible (meaning using common everyday language) versus academic (inspired by @asomoza 's comment here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit spread thin on this one. So, I will go with what the team prefers.

@stevhliu stevhliu marked this pull request as ready for review April 23, 2025 18:21
@stevhliu
Copy link
Member Author

Thanks @hlky, those PRs look to be super nice for user experience and I'll update the code examples once it's merged! The configurable example is also really neat and maybe we can make a Space out of it and embed it in the docs? I'll probably have to follow up on this one in a separate PR though 😅

@sayakpaul
Copy link
Member

sayakpaul commented Apr 28, 2025

@stevhliu sorry for the delay on my end. The changes look nice and I responded to some of the questions/comments you had. Perhaps after #11130, we could simplify the quantization examples a bit.

@a-r-r-o-w do we want to touch any other video models in this PR?

@stevhliu
Copy link
Member Author

@sayakpaul, I simplified the examples with the new PipelineQuantizationConfig! Let me know if there are any other changes you'd like to see, otherwise I think we can merge!

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better and another round of feedback.

## Notes

### Memory optimization
- CogVideoX supports LoRAs with [`~loaders.CogVideoXLoraLoaderMixin.load_lora_weights`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this separate note besides having the LoRA marker button at the top of the page?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to have a easy copy/paste example for users who want to use this specific model, will fold under collapsible section as suggested. I also added a link to the LoRA marker button at the top :)

<hfoption id="inference speed">

Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`HunyuanVideoPipeline`] for inference with bitsandbytes.
Compilation is slow the first time but subsequent calls to the pipeline are faster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compilation, should we also refer the readers to our compilation guide?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to the compile section in fp16.md and will combine torch2.0.md with it in a separate PR as discussed!

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. @DN6 @a-r-r-o-w do you want to review this as well? I think with the video models becoming more and more crucial these docs will be very important!

@stevhliu
Copy link
Member Author

stevhliu commented Jun 2, 2025

I'll merge for now and we can keep iterating on it if there is additional feedback! 🤗

@stevhliu stevhliu merged commit c934720 into huggingface:main Jun 2, 2025
1 check passed
@stevhliu stevhliu deleted the video branch June 2, 2025 23:55
faaany pushed a commit to faaany/diffusers that referenced this pull request Jun 3, 2025
* initial

* update

* hunyuanvideo

* ltx

* fix

* wan

* gen guide

* feedback

* feedback

* pipeline-level quant config

* feedback

* ltx

This reverts commit 2ebe9ca.
export_to_video(output, "wan-i2v.mp4", fps=16)
```

### First and Last Frame Interpolation
Copy link
Contributor

@a-r-r-o-w a-r-r-o-w Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this PR deleted the examples for different pipelines. Is this intended? I'm adding it back in the examples back in #11582 since we need it to be documented

Edit: Ah I see, it's the same as the example docstring, so makes sense. It's just the FLF2V documentation disappeared and I thought that some other things went missing too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants