Skip to content

EvolvingLMMs-Lab/Evolving-Visual-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Five-level taxonomy of visual generation

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

arXiv Page GitHub HF Paper

This repository hosts a living roadmap on modern visual generation. The project organizes recent progress in image generation and editing around a capability-oriented view of visual intelligence: moving from one-shot appearance synthesis toward controllable composition, persistent context, agentic interaction, and causal world modeling.

A companion Visual Generation Roadmap website is available, which carries a richer visualization of the taxonomy, the modern research landscape, and the full gallery of stress-test cases. The roadmap is intended to grow with the community: if you have a paper that should be included, or notice a missing reference or mis-classification, please feel free to open a pull request or an issue, and we will keep updating both the survey and the website accordingly. If you find any part of this work useful or interesting, we would also be very happy if you consider citing it.

Core Thesis

Recent visual generation models have improved photorealism and instruction following, but stronger images do not automatically imply stronger visual intelligence. The next bottlenecks are structural, temporal, and causal: models must preserve identity, obey spatial constraints, render exact symbols, reason over external data, interact through closed loops, and verify that generated artifacts satisfy the intended constraints.

We frame this evolution as a five-level progression:

Level Capability Short Description
L1 Atomic Generation One-shot probabilistic rendering from prompts or latent codes.
L2 Conditional Generation Faithful generation under explicit controls, layouts, references, or constraints.
L3 In-Context Generation Multi-reference, multi-condition, and long-context generation with persistent state.
L4 Agentic Generation Multi-call planning, generation, verification, rollback, and tool use.
L5 World-Modeling Generation Causal, physical, and action-conditioned simulation of visual worlds.

What Is in This Repo

Roadmap at a Glance

The roadmap argues that progress is no longer a single axis of image fidelity. It is a nested expansion of capability:

  1. Modeling moves from GANs to diffusion, flow matching, autoregressive modeling, and hybrid AR-diffusion systems.
  2. Architecture converges toward tokenizers/VAEs, transformer backbones, condition modules, and multimodal fusion mechanisms.
  3. Training shifts from scale alone to data density, VLM relabeling, continued training, SFT, preference optimization, and deployment acceleration.
  4. Applications increasingly demand verifiable constraints: exact text, layout, identity, domain rules, external data, and physical interaction.
  5. Evaluation must move from perceptual similarity toward parsers, OCR, graph validators, simulators, theorem checkers, and red-team agents.

Selected Figures

Topic Figure
Research landscape Research landscape
Modeling paradigms Modeling paradigms
Closed-source agentic systems Closed-source agentic systems
Training pipeline Training pipeline
Data pipeline Data pipeline

Stress-Test Examples

Standard metrics can miss failures that matter. This repo includes selected qualitative cases where outputs are visually polished but violate geometric, topological, physical, or procedural constraints.

Test Target Capability Typical Failure
Jigsaw reconstruction Spatial structuring Hallucinates plausible content instead of rigidly reassembling pieces.
Metro map Graph/topology following Produces a convincing map but violates transfer and crossing constraints.
Isometric tile map Coordinate grounding Places objects in nearby but incorrect grid cells.
Fluid dynamics Causal state transition Must distinguish plausible appearance from physically faithful intervention.
Multi-turn editing Persistent identity and constraint memory across turns Drifts in identity, layout, or previously satisfied constraints as edits accumulate; later turns silently undo earlier ones.
Long-form text rendering Exact symbolic rendering and typography Generates near-correct glyphs with character-level errors, swapped digits, or inconsistent fonts in long strings.
Counting and quantity Numerical grounding Produces a visually plausible scene with the wrong number of instances when the prompt specifies an exact count.
Occlusion and depth ordering 3D-consistent compositional reasoning Renders objects with mutually inconsistent occlusion or depth cues that violate a single 3D layout.
Compositional binding Attribute-to-entity binding Swaps or merges colors, materials, and parts across multiple bound entities in the same scene.

The full gallery, including more multi-turn editing cases, is hosted on the project page; see docs/stress_tests.md for additional details.

Reference Organization

The full bibliography is maintained in references/citation.bib. The list below follows the roadmap sections and uses an awesome-list style: each entry gives the concrete paper name, a paper link when available (preferably arXiv), venue/year, and a short role in the roadmap.

Sec. 1: Motivation and New-Era Visual Generation

Sec. 2: Five-Level Taxonomy of Visual Intelligence

Sec. 3: Modeling Paradigms and Architectures

Sec. 4: Training, Alignment, and Acceleration

Sec. 5: Data, Benchmarks, and Infrastructure

Sec. 6: Applications and Evolving Frontiers

Sec. 7: In-the-Wild Stress Tests

  • Low-complexity single-image super-resolution based on nonnegative neighbor embedding (BMVC, 2012) — Classical super-resolution benchmark used for low-level restoration.
  • Deep retinex decomposition for low-light enhancement (arXiv, 2018) — Low-light enhancement benchmark.
  • A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics (ICCV, 2001) — Berkeley segmentation dataset used in denoising/restoration evaluation.
  • Deep joint rain detection and removal from a single image (CVPR, 2017) — Rain removal benchmark for deraining stress tests.
  • Deep multi-scale convolutional neural network for dynamic scene deblurring (CVPR, 2017) — Dynamic-scene deblurring benchmark.

Sec. 8: Future Directions

  • Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing (arXiv, 2025) — Planning-based complex instruction image editing.
  • MIRA: Multimodal Iterative Reasoning Agent for Image Editing (arXiv, 2025) — Multimodal iterative reasoning agent for editing.
  • Image Editing As Programs with Diffusion Models (arXiv, 2025) — Programmatic view of image editing with diffusion models.
  • AI-Generated Images as Data Source: The Dawn of Synthetic Era (arXiv, 2023) — Position paper on AI-generated images as synthetic training data.
  • Recurrent world models facilitate policy evolution (NeurIPS, 2018) — Early recurrent world model for policy evolution.
  • Mastering diverse control tasks through world models (Nature, 2025) — Generalist world-model RL across diverse control tasks.
  • A path towards autonomous machine intelligence (Open Review, 2022) — Predictive world-modeling agenda for autonomous intelligence.
  • Genie: Generative Interactive Environments (ICML, 2024) — Generative interactive environments from unlabeled videos.
  • Diffusion for world modeling: Visual details matter in Atari (NeurIPS, 2024) — Diffusion-based Atari world modeling.
  • Oasis: A universe in a transformer (Technical Report, 2024) — Transformer-based interactive Minecraft-like world model.
  • GameGen-X: Interactive open-world game video generation (ICLR, 2025) — Interactive open-world game video generation.
  • World simulation with video foundation models for physical AI (arXiv, 2025) — Video foundation model for physical-AI world simulation.
  • ST-Raptor: LLM-Powered Semi-Structured Table Question Answering (SIGMOD, 2026) — Semi-structured table question answering with hierarchical trees.
  • MoDora: Tree-Based Semi-Structured Document Analysis System (SIGMOD, 2026) — Tree-based semi-structured document analysis.
  • FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data (arXiv, 2025) — Data-agent benchmark over heterogeneous analytical queries.
  • Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks (SIGMOD, 2021) — Natural-language-to-visualization benchmark synthesis.
  • DataVisT5: A Pre-Trained Language Model for Jointly Understanding Text and Data Visualization (ICDE, 2025) — Unified model for text and data visualization understanding.

Community suggestions are welcome — please open a pull request or an issue with the paper you would like to see added, and we will keep folding new entries into the roadmap.

Citation

If you find this roadmap useful, please cite the project. A formal arXiv citation will be added once available.

@article{wu2026visual,
  title={Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling},
  author={Wu, Keming and Yang, Zuhao and Zhang, Kaichen and Wang, Shizun and Zhu, Haowei and Leng, Sicong and Yang, Zhongyu and Wang, Qijie and Wang, Sudong and Wang, Ziting and others},
  journal={arXiv preprint arXiv:2604.28185},
  year={2026}
}

⭐ Star History

Star History Chart

About

[Roadmap] Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages