Skip to content

Step-Video-T2V #10812

@tin2tin

Description

@tin2tin

New txt2vid project:

A Step-Video-T2V, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.

https://github.com/stepfun-ai/Step-Video-T2V

412358956-9274b351-595d-41fb-aba3-f58e6e91603a.mp4
Model height/width/frame Peak GPU Memory 50 steps w flash-attn 50 steps w/o flash-attn
Step-Video-T2V 544px992px204f 77.64 GB 743 s 1232 s
Step-Video-T2V 544px992px136f 72.48 GB 408 s 605 s
Models 🤗Huggingface 🤖Modelscope
Step-Video-T2V download download
Step-Video-T2V-Turbo (Inference Step Distillation) download download

Metadata

Metadata

Assignees

Labels

staleIssues that haven't received updates

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions