Skip to content

alibaba-damo-academy/Inferix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

42 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Preprint | Data | Preprint | Static Badge

Follow us on HF

๐Ÿ“ข News

๐Ÿš€ About

World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in blocksโ€”applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences.

Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation.

Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs).

Architecture Overview

โœจ Key Features

  • ๐Ÿง  Advanced KV Cache Management: Intelligent memory management for persistent world simulation
  • ๐Ÿ”€ Distributed World Synthesis: Support for large-scale immersive environment generation
  • ๐Ÿ“น Video Streaming: Basic video streaming capabilities for generated content, with both RTMP and WebRTC supported as streaming protocols.
  • ๐ŸŽฎ Interactive Generation: Real-time video preview with Gradio UI, supports prompt changes and generation controls (pause/resume/stop). Works on 16GB consumer GPUs.
  • ๐Ÿ”ง Seamless Model Integration: Simple API for world model deployment
  • ๐Ÿ“Š Next-Gen Architecture: Built for immersive world synthesis at scale
  • ๐Ÿ“ˆ Built-in Profiling: Performance monitoring and analysis capabilities with enhanced diffusion model profiling
  • ๐Ÿ”„ Continuous Prompt Support: Enable dynamic narrative control with different prompts for different video segments (see CausVid example)
  • ๐Ÿš€ Quantized Inference: 8-bit(INT8 / FP8) quantization(Per-tensor / Per-token-per-channel) with DAX support

Framework Architecture

๐Ÿงฉ Semi-AR Decode Architecture

Inferix separates semi-autoregressive diffusion blocks from VAE decoding, and exposes a small set of configurable modes:

  • Diffusion blocks (block size): Model-level generation units (e.g., 3 frames/block in Self-Forcing) that control KV Cache updates and semi-autoregressive behavior.
  • VAE chunks (chunk size): Temporal slices used only inside the VAE to bound peak VRAM during decoding.
  • Decode timing modes:
    • AFTER_ALL: Decode after all diffusion blocks are generated (offline / batch usage).
    • PER_BLOCK: Decode each block as soon as it is ready (used by progressive streaming APIs).
    • NO_DECODE: Skip decoding and operate on latents only (advanced/integration scenarios).
  • Memory strategy: KV Cache can be freed before VAE decoding on memory-constrained GPUs, and VAE chunk size can be tuned to trade latency for peak memory.

For end-to-end progressive streaming usage and recommended settings on a single GPU, see the streaming example.

๐Ÿ—“๏ธ Roadmap

Framework Enhancements

  • Complex KV Management
  • Support finetuning a pretrained video gen model (Diffusion to Semi-AR) & distill models into few steps.
  • Support high-concurrency deployment
  • Support more complex distributed inference
  • Improve video streaming usage and performance
  • Advanced real-time streaming capabilities

World Model Support

  • Interactive World Models (basic: prompt change, pause/resume/stop)
  • Enhanced Simulation Capabilities
  • Persistent World State Management

๐Ÿš€ Getting Started

Installation

See Installation Guide for detailed instructions.

Run Examples

Check out our example configurations for different models:

Supported Semi-autoregressive Models

๐Ÿ”ง Model Integration Guide

View our Model Integration Guide for detailed instructions. Below is a simple guide:

To integrate your own semi-autoregressive models with Inferix, follow these steps:

1. Create Model Directory Structure

inferix/
โ””โ”€โ”€ models/
    โ””โ”€โ”€ your_model_name/
        โ”œโ”€โ”€ __init__.py
        โ”œโ”€โ”€ model.py              # Model architecture implementation
        โ”œโ”€โ”€ config.py             # Model-specific configuration handling
        โ””โ”€โ”€ utils.py              # Utility functions for your model (optional)

1.1 Using Wan-Base model

Since Wan-1.3B is widely used as bisic pretrained diffusion model in the world model community, we provide it as a base model. You can extend it in the models/wan_base directory, just like Self Forcing and CausVid. If you need other base models, please let us know and we will provide them as soon as possible.

2. Implement Pipeline Class

Create a pipeline class that inherits from AbstractInferencePipeline.

Key methods to implement:

  • load_checkpoint(): Load model weights
  • run_text_to_video(): Text-to-video generation
  • run_image_to_video(): Image-to-video generation
  • _initialize_pipeline(): Custom initialization logic

3. Create Example

Add example in the example/your_model_name/ directory:

  • README.md with usage instructions
  • run_your_model.py for execution
  • shell script for execution

3.1. Add Configuration Files at Example Directory

Create YAML or JSON configuration files for your model in the example/your_model_name/configs/ directory.

4. Update Documentation

Add your model to the main README and create detailed documentation in your model's example directory.

Benchmarks

LVBench

Benchmark

LV-Bench is a curated benchmark of 1,000 minute-long videos targeted at evaluating long-horizon generation. Please refer to LV-Bench for more details.

๐Ÿ“ License

This project is licensed under the Apache License 2.0.

The main code of Inferix is based on the Apache 2.0 license. However, some included third-party components may be subject to their respective open-source licenses. Users should comply with the corresponding licenses of these third-party components when using them.

๐Ÿ“ž Contact Us

For questions and support, please reach out through:

๐Ÿ“š Citation

If you use Inferix in your research, please cite:

@article{team2025inferix,
  title={Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation},
  author={Team, Inferix and Feng, Tianyu and Han, Yizeng and He, Jiahao and He, Yuanyu and Lin, Xi and Liu, Teng and Lu, Hanfeng and Tang, Jiasheng and Wang, Wei and others},
  journal={arXiv preprint arXiv:2511.20714},
  year={2025}
}

๐Ÿ™ Acknowledgments

We thank the following projects for their contributions:

Team Members:

We are a joint team from ZJU & HKUST & Alibaba DAMO Academy & Alibaba TRE.

Current Members:

  • Tianyu Feng
  • Yizeng Han
  • Jiahao He
  • Yuanyu He
  • Xi Lin
  • Teng Liu
  • Hanfeng Lu
  • Jiasheng Tang
  • Wei Wang
  • Zhiyuan Wang
  • Jichao Wu
  • Mingyang Yang
  • Yinghao Yu
  • Zeyu Zhang
  • Bohan Zhuang

About

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors