Skip to content

ByteVisionLab/NextFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NextFlow🚀: Unified Sequential Modeling Activates Multimodal Understanding and Generation

NextFlow  VAR RL  Visitors


🚀 Overview

NextFlow is a unified decoder-only autoregressive transformer trained on 6T interleaved text-image tokens. It bridges the gap between understanding and generation within a single architecture, redefining sequential modeling.

  • 🌟 Unified Architecture: Seamlessly integrates multimodal generation, editing, and understanding in one decoder-only transformer, removing the need for separate diffusion or LLM backbones.
  • 🌟 Next-Scale Prediction: A hierarchical prediction paradigm enables generating 1024×1024 images in just 5 seconds—significantly faster than comparable AR models.
  • 🌟 SOTA Performance: Achieves state-of-the-art scores on DPG (88.32) and ImgEdit (4.49), matching specialized diffusion models in quality while retaining LLM reasoning capabilities.
  • 🌟 Advanced Capabilities: Unlocks native Chain-of-Thought (CoT) reasoning, in-context editing, and interleaved generation without re-encoding overhead.

🎨 Demo

High-Fidelity Generation

NextFlow produces high-fidelity visuals with exceptional prompt adherence, adeptly handling complex spatial relationships and cultural nuances.

Text to Image Demo

Complex Instruction Following

The model demonstrates precise alignment between text and images, ensuring accurate representation of detailed descriptions.

Text to Image Demo

Image Editing

NextFlow supports precise, instruction-based editing. It modifies specific regions, styles, or attributes based on natural language commands while preserving the original structure and background consistency.

Image editing Demo

CoT Reasoning

By handling interleaved sequences naturally, the model employs Chain-of-Thought reasoning to refine prompts and plan before generating visual content.

Interleaved Demo

Interleaved Generation

Leveraging robust in-context learning, NextFlow performs zero-shot image editing and subject-driven generation effortlessly.

Editing Demo

🏆 Benchmark Evaluation

EditCanvas Benchmark

To overcome dataset limitations, we introduce EditCanvas, a rigorous benchmark covering Traditional Editing and Subject-Driven Generation across 56 tasks with over 5,000 high-quality samples.

EditCanvas Results

Comparison with SOTA

We compare NextFlow against leading unified models (Bagel, Emu3.5) and specialized diffusion models. On the DPG benchmark, NextFlow RL scores 88.32, matching Qwen-Image and outperforming all other models. On ImgEdit, it sets a new state-of-the-art with a score of 4.49.

Radar Chart Comparison

📖 Introduction of NextFlow

NextFlow represents a paradigm shift in autoregressive visual generation. By treating images as hierarchical structures, we achieve specialized diffusion model's quality while keep LLM's reasoning.

Model Architecture: Decoder-Only Transformer

Initialized from Qwen2.5-VL-7B, NextFlow extends the standard LLM architecture for visual token prediction. We utilize a Unified Tokenizer, Scale Reweighting, and Self-Correction with Residual Features to stabilize large-scale corpus training and achieve high performance.

frame work
Training Odyssey

Our pipeline is validated on 6 trillion tokens, ensuring robust multimodal capabilities.

  • Alignment & Pre-Training: Large-scale training on text, image-text pairs, and interleaved data.
  • Reinforcement Learning (RL): We introduce a prefix-tuning strategy for Group Reward Policy Optimization (GRPO), focusing on coarse-scale "prefixes" to stabilize global structure optimization.
Training Pipeline
Inference Efficiency

NextFlow is highly efficient, enabling thegeneration of 1024 × 1024 images in just 5 seconds—orders of magnitude faster than comparableAR models. And NextFlowrequires 6× fewer FLOPs than MMDiT-based diffusion models at 1024² resolution. Its next-scale approach enables dynamic resolution generation without the typical computational costs of autoregression.


Citation

@article{zhang2026nextflow,
  title={NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation},
  author={Zhang, Huichao and Qu, Liao and Liu, Yiheng and Chen, Hang and Song, Yangyang and Dong, Yongsheng and Sun, Shikun and Li, Xian and Wang, Xu and Jiang, Yi and others},
  journal={arXiv preprint arXiv:2601.02204},
  year={2026}
}
@article{sun2026var,
  title={VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation},
  author={Sun, Shikun and Qu, Liao and Zhang, Huichao and Liu, Yiheng and Song, Yangyang and Li, Xian and Wang, Xu and Jiang, Yi and Du, Daniel K and Wu, Xinglong and others},
  journal={arXiv preprint arXiv:2601.02256},
  year={2026}
}

About

NextFlow🚀: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •