NextFlow is a unified decoder-only autoregressive transformer trained on 6T interleaved text-image tokens. It bridges the gap between understanding and generation within a single architecture, redefining sequential modeling.
- 🌟 Unified Architecture: Seamlessly integrates multimodal generation, editing, and understanding in one decoder-only transformer, removing the need for separate diffusion or LLM backbones.
- 🌟 Next-Scale Prediction: A hierarchical prediction paradigm enables generating 1024×1024 images in just 5 seconds—significantly faster than comparable AR models.
- 🌟 SOTA Performance: Achieves state-of-the-art scores on DPG (88.32) and ImgEdit (4.49), matching specialized diffusion models in quality while retaining LLM reasoning capabilities.
- 🌟 Advanced Capabilities: Unlocks native Chain-of-Thought (CoT) reasoning, in-context editing, and interleaved generation without re-encoding overhead.
NextFlow produces high-fidelity visuals with exceptional prompt adherence, adeptly handling complex spatial relationships and cultural nuances.
The model demonstrates precise alignment between text and images, ensuring accurate representation of detailed descriptions.
NextFlow supports precise, instruction-based editing. It modifies specific regions, styles, or attributes based on natural language commands while preserving the original structure and background consistency.
By handling interleaved sequences naturally, the model employs Chain-of-Thought reasoning to refine prompts and plan before generating visual content.
Leveraging robust in-context learning, NextFlow performs zero-shot image editing and subject-driven generation effortlessly.
To overcome dataset limitations, we introduce EditCanvas, a rigorous benchmark covering Traditional Editing and Subject-Driven Generation across 56 tasks with over 5,000 high-quality samples.
We compare NextFlow against leading unified models (Bagel, Emu3.5) and specialized diffusion models. On the DPG benchmark, NextFlow RL scores 88.32, matching Qwen-Image and outperforming all other models. On ImgEdit, it sets a new state-of-the-art with a score of 4.49.
NextFlow represents a paradigm shift in autoregressive visual generation. By treating images as hierarchical structures, we achieve specialized diffusion model's quality while keep LLM's reasoning.
Initialized from Qwen2.5-VL-7B, NextFlow extends the standard LLM architecture for visual token prediction. We utilize a Unified Tokenizer, Scale Reweighting, and Self-Correction with Residual Features to stabilize large-scale corpus training and achieve high performance.
Our pipeline is validated on 6 trillion tokens, ensuring robust multimodal capabilities.
- Alignment & Pre-Training: Large-scale training on text, image-text pairs, and interleaved data.
- Reinforcement Learning (RL): We introduce a prefix-tuning strategy for Group Reward Policy Optimization (GRPO), focusing on coarse-scale "prefixes" to stabilize global structure optimization.
NextFlow is highly efficient, enabling thegeneration of 1024 × 1024 images in just 5 seconds—orders of magnitude faster than comparableAR models. And NextFlowrequires 6× fewer FLOPs than MMDiT-based diffusion models at 1024² resolution. Its next-scale approach enables dynamic resolution generation without the typical computational costs of autoregression.
@article{zhang2026nextflow,
title={NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation},
author={Zhang, Huichao and Qu, Liao and Liu, Yiheng and Chen, Hang and Song, Yangyang and Dong, Yongsheng and Sun, Shikun and Li, Xian and Wang, Xu and Jiang, Yi and others},
journal={arXiv preprint arXiv:2601.02204},
year={2026}
}@article{sun2026var,
title={VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation},
author={Sun, Shikun and Qu, Liao and Zhang, Huichao and Liu, Yiheng and Song, Yangyang and Li, Xian and Wang, Xu and Jiang, Yi and Du, Daniel K and Wu, Xinglong and others},
journal={arXiv preprint arXiv:2601.02256},
year={2026}
}







