Add MAGI-1: Autoregressive Video Generation at Scale #11713

tolgacangoz · 2025-06-14T05:03:58Z

Thanks for the opportunity to fix #11519!

Original repo: https://github.com/SandAI-org/MAGI-1

✅ AutoencoderKLMagi1: Tiling option by @kuantuna -> Feat: Implement tiling in VAE tolgacangoz/diffusers#6
⏳ Magi1Transformer3DModel
⏳ Support kernels with kernels?
⏳ MAGI1Pipeline, MAGI1ImageToVideoPipeline, MAGI1VideoToVideoPipeline.
❔ Since this is an autoregressive model, support for KV Caching?
...
⏳ Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
⏳ Did you write any new necessary tests?

Try MAGI1Pipelines!

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

…te attention mechanism accordingly. Updated initialization parameters and reshaping logic.

…tering and equal split ratio. Add utility functions for resizing and cropping images while preserving aspect ratio. Enhance 3D rotary positional embeddings Adds `center_grid_hw_indices` and `equal_split_ratio` parameters to the 3D rotary positional embedding function for more flexible configuration. The `center_grid_hw_indices` option centers the spatial grid indices around zero. The `equal_split_ratio` parameter provides an alternative way to divide the embedding dimension equally among the temporal and spatial axes. Updates the Magi1 VAE to utilize these new embedding features, introducing helper functions to prepare the embeddings dynamically based on input tensor dimensions.

Replaces the initial causal 3D convolution in the encoder with a standard `Conv3d` patch embedding layer. This simplifies the model and makes its input processing more consistent with Diffusion Transformer (DiT) architectures. Additionally, this change: - Removes the unused `Magi1CausalConv3d` class. - Updates the attention mechanism to use the standard `scaled_dot_product_attention`. - Sets the default for `sample_posterior` to `True` in the forward pass.

Removes the feature caching logic (`feat_cache`, `feat_idx`) from the encoder, decoder, and their sub-modules. This change significantly simplifies the forward pass implementation by removing stateful cache management. Additionally, this commit replaces the custom `Magi1RMS_norm` with a standard `nn.LayerNorm` and updates several custom causal convolution layers to use standard `nn.Linear` or `nn.Conv3d` layers.

Moves the positional embedding and dropout layers from the main autoencoder class into the decoder module. This improves encapsulation as the embedding is only used within the decoder. The decoder's forward pass is updated to apply the positional embedding and to remove the class token before the final output convolution. Additionally, `quant_conv` is renamed to `quant_linear` to accurately reflect the layer type.

Updates the `Magi1Decoder3d` from a convolutional design to a Transformer-like structure that operates on patches. This change replaces the initial convolutional and middle blocks with a linear projection layer, positional embeddings, and a class token. The logic for these components is moved from the parent `AutoencoderKLMagi1` model into the decoder for better encapsulation.

Removes several custom modules, including `Magi1ResidualBlock`, `Magi1Resample`, and `Magi1UpBlock`. Replaces the previous `Magi1MidBlock` with a more standard transformer-style `Magi1Block`. This change simplifies the overall VAE architecture by consolidating complex, specialized blocks into a more conventional design.

Replaces the custom `Magi1AttentionBlock` with the more generic `diffusers.Attention` module, combined with a new (?) `Magi1AttnProcessor2_0`. This change aligns the implementation with standard library patterns and leverages PyTorch 2.0's `scaled_dot_product_attention` for improved efficiency. The `Magi1Block` is also refactored into a more conventional transformer block structure using `Attention` and `FeedForward` modules.

Refactors the Magi1 VAE decoder to use a more standard transformer-based architecture. This change replaces the previous U-Net-like upsampling blocks with a series of standard transformer blocks, each containing self-attention and a feed-forward network. The custom rotary positional embedding logic and its helper functions have been removed, and the attention processor is simplified to work with the standard `Attention` module. This simplifies the overall model implementation.

Replaces the previous convolutional U-Net style encoder with a Vision Transformer (ViT) based implementation. This new architecture processes the input by dividing it into patches, adding positional embeddings, and then passing the sequence through a series of transformer blocks. The attention processor is also updated to support attention masks, and the model's configuration is adjusted to accommodate the new transformer-specific parameters.

Removes complex and unused parameters from the Magi1 VAE, encoder, and decoder modules. This change refactors the model to use a more standard Transformer architecture, eliminating the previous U-Net-like structure with dimension multipliers and residual blocks. The configuration is now more direct, improving clarity and maintainability.

Simplifies the initialization of the Magi1 VAE, encoder, and decoder. Reorders constructor parameters for clarity and removes unused arguments. The spatial and temporal compression ratios are now derived directly from the `patch_size` configuration, making the relationship more explicit. The pipeline is updated to use these new VAE attributes.

Simplifies the model architecture by removing the quantization and post-quantization convolution layers. This streamlines the `encode` and `decode` methods. The decoder is also updated to process the entire latent tensor at once, removing the previous frame-by-frame processing loop. Additionally, this change updates an import path for the `timm` library and renames an internal variable for consistency.

Updates the conversion script for the MAGI-1 VAE to correctly handle its Vision Transformer (ViT) based architecture. The state dictionary mapping is rewritten to align with the ViT structure. This includes adding logic to split the original checkpoint's combined QKV weights into separate query, key, and value tensors for the `diffusers` model. The model class and its configuration are also updated to reflect the appropriate ViT parameters, ensuring a correct conversion.

Renames the Magi autoencoder class to align with the "MAGI-1" model name. This refactoring improves consistency and clarity throughout the codebase, including documentation and tests.

Aligns the model naming with the source paper, "MAGI-1". This change refactors the model class, associated files, tests, and documentation to use the `Magi1` prefix for better clarity and consistency.

…ross multiple files

Improve compatibility by handling various PyTorch checkpoint formats. The loader now correctly extracts the state dictionary when it is nested under common keys like "model" or "state_dict". Ensure consistent loading of sharded safetensors by sorting the checkpoint files before merging them.

…e attention dispatch

This includes: - Adding `magi` to the list of available attention backends in the documentation. - Adding utility functions to check for the availability and version of the `magi_attention` package.

… updated examples and tests

…cation and partial loading support temply - Added `allow_partial` parameter to `convert_magi1_transformer_checkpoint` for flexible state dict loading. - Improved `convert_transformer_state_dict` to generate a detailed mapping report, including missing and unexpected keys, and shape mismatches. - Updated command-line arguments to support new features. - Enhanced error handling and reporting during conversion verification. - Refactored related functions to accommodate changes in the transformer architecture.

…pelines with new scheduler integration - Added `_magi_varlen_attention` function to support variable-length sequences in MAGI-1 using the magi-attention library. - Updated `Magi1AttnProcessor` to utilize the MAGI backend when variable-length parameters are provided. - Integrated `FlowMatchEulerDiscreteScheduler` into example usage for `Magi1ImageToVideoPipeline`, `Magi1VideoToVideoPipeline`, and `Magi1Pipeline`, emphasizing the required shift parameter for proper functionality. - Refactored code to improve clarity and maintainability, including normalization of timestep calculations in the pipelines.

kuantuna · 2025-10-24T16:14:46Z

Hi @tolgacangoz, i'll have some free time this month and wanna help you merge this pr. Any specific things that i can look into?

I can start with vae tiling, i think you had some open comments back then. To be honest, just changing their implementation not to break the diffusers functional structure felt to me like not the best way to implement the tiled decode. So maybe, i can start by reverting that and instead use the original "sample-then-blend" approach (reference). Then move onto other things. Wdyt?

tolgacangoz · 2025-10-24T17:20:33Z

I think we could return to VAE again in the end. You can go on with one of the pipeline groups: T2V or (I2V and V2V).

kuantuna · 2025-10-24T17:35:21Z

That works for me. Sure i can start with T2V then. I saw that there is already a pipeline_magi1.py. Isn't this completed or?
Edit: You already wrote some explanations on top of the file, that's pretty self-explanatory thanks! So, i guess next step would be implementing the kv caching?

tolgacangoz added 30 commits June 13, 2025 22:09

first draft

dabb12f

style

89806ea

upp

f4b5748

style

9b45317

Merge branch 'main' into add-magi-1

8e5881b

2nd draft

03d50e2

2nd draft

08287a9

up

8784881

Refactor Magi1AttentionBlock to support rotary embeddings and integra…

ae03b7d

…te attention mechanism accordingly. Updated initialization parameters and reshaping logic.

Merge branch 'main' into add-magi-1

61e7cb0

style

1898e19

Rename AutoencoderKLMagi to AutoencoderKLMagi1

d5f5594

Renames the Magi autoencoder class to align with the "MAGI-1" model name. This refactoring improves consistency and clarity throughout the codebase, including documentation and tests.

Refactor: Rename Magi to Magi1

0cb50c9

Aligns the model naming with the source paper, "MAGI-1". This change refactors the model class, associated files, tests, and documentation to use the `Magi1` prefix for better clarity and consistency.

style

af5b575

Merge branch 'main' into add-magi-1

7a4af97

Refactor: Update references from MagiPipeline to Magi1Pipeline ac…

b5e140b

…ross multiple files

tolgacangoz mentioned this pull request Sep 10, 2025

Add Wan2.2-S2V: Audio-Driven Cinematic Video Generation #12258

Open

tolgacangoz and others added 26 commits September 15, 2025 19:57

Merge branch 'main' into add-magi-1

73ba2c7

Begin to propose adding support for the MagiAttention backend to th…

98cdfb5

…e attention dispatch

style

d03d4bc

Merge branch 'main' into add-magi-1

b6d973e

Add support for parallel_config in the attention processor

98e92fc

Up for the MagiAttention backend.

92cbe8f

This includes: - Adding `magi` to the list of available attention backends in the documentation. - Adding utility functions to check for the availability and version of the `magi_attention` package.

Merge branch 'main' into add-magi-1

041a7e1

Merge branch 'main' into add-magi-1

8fb783e

[test] add unit tests for AutoencoderKLMagi1 model

a619b6b

[test] remove obsolete tests for AutoencoderKLMagi1 model

a730a66

up FeedForward import

de2277e

up tr

c9bbb2d

up template for pipet2v

cb4d92e

up tr

65b59e3

up pipet2v

c94d554

style

8d7ab21

up

454b26d

up i2v

91e0bdf

upp

0370762

Enhance MAGI-1 documentation and improve Video-to-Video pipeline with…

1f36d30

… updated examples and tests

up pipes

8fff46a

style

1286419

Merge branch 'main' into add-magi-1

b99c0ea

up

8f94ae8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MAGI-1: Autoregressive Video Generation at Scale #11713

Add MAGI-1: Autoregressive Video Generation at Scale #11713

tolgacangoz commented Jun 14, 2025 •

edited

Loading

Uh oh!

kuantuna commented Oct 24, 2025 •

edited

Loading

Uh oh!

tolgacangoz commented Oct 24, 2025

Uh oh!

kuantuna commented Oct 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add MAGI-1: Autoregressive Video Generation at Scale #11713

Are you sure you want to change the base?

Add MAGI-1: Autoregressive Video Generation at Scale #11713

Conversation

tolgacangoz commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who can review?

Uh oh!

kuantuna commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tolgacangoz commented Oct 24, 2025

Uh oh!

kuantuna commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tolgacangoz commented Jun 14, 2025 •

edited

Loading

kuantuna commented Oct 24, 2025 •

edited

Loading

kuantuna commented Oct 24, 2025 •

edited

Loading