Skip to content

OpenDCAI/Awesome_MLLMs_Reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Awesome_MLLMs_Reasoning

In this repository, we will continuously update the latest papers, projects, and other valuable resources that advance MLLM reasoning, making learning more efficient for everyone!

📢 Updates

  • 2025.03: We released this repo. Feel free to open pull requests.

📚 Table of Contents

📖 Papers

📝 1.Technical Report

We also feature some well-known technical reports on Large Language Models (LLMs) reasoning.

  • [2507] [GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning] (GLM-V Team) Technical Report

  • [2506] [MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention] (MiniMax Team) Technical Report

  • [2506] [MiMo-VL Technical Report] (LLM-Core Xiaomi) Technical Report

  • [2504] [Kimi-VL Technical Report] (Kimi Team) Technical Report

  • [2503] [Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought] (SkyWork AI) Technical Report Model

  • [2503] [Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs] (Microsoft) Technical Report

  • [2503] [QwQ-32B: Embracing the Power of Reinforcement Learning](Qwen Team) Technical Report CodeModel

  • [2501] [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](DeepSeek Team) Technical Report

  • [2501] [Kimi k1.5: Scaling Reinforcement Learning with LLMs](Kimi Team) Technical Report

📌 2.Generated Data Guided Post-Training

Title Date Code/Huggingface Taxonomy
Star
Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning
2025.07 Github Huggingface SFT
Star
VTS-V: Multi-step Visual Reasoning with Visual Tokens Scaling and Verification
2025.06 Github Paper SFT

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
2025.06 Paper RL

SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
2025.06 Paper RL
Star
Truth in the Few: High-Value Data Selection for Efficient Multi-Modal Reasoning
2025.06 Github RL

GRIT: Teaching MLLMs to Think with Images
2025.05 Paper SFT

DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
2025.05 Paper RL

Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
2025.05 Paper RL

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
2025.04 Paper RL
Star
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
2025.03 Paper Code RL

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning
2025.03 Paper RL

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
2025.03 Paper RL
Star
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
2025.03 Github Huggingface RL
Star
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-Based Reinforcement Learning
2025.03 Github Huggingface RL

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
2025.03 Paper RL
Star
R1-Zero’s “Aha Moment” in Visual Reasoning on a 2B Non-SFT Model
2025.03 Paper Blog Code RL
Star
Visual-RFT: Visual Reinforcement Fine-Tuning
2025.03 Paper Code RL
Star
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
2025.02 Paper Code RLHF
Star
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification
2025.02 Paper Code RL
Star
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
2025.02 Paper Code RLHF

Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step
2025.01 Paper Code RL

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
2025.01 Paper RL

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
2025.01 Paper RL

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
2025.01 Paper RL

Technical Report on Slow Thinking with LLMs: Visual Reasoning
2025.01 Paper RL

Progressive Multimodal Reasoning via Active Retrieval
2024.12 Paper RL

Diving into Self-Evolving Training for Multimodal Reasoning
2024.12 Paper RL

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
2024.12 Paper RL

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
2024.12 Paper RL

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
2024.12 Paper SFT

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
2024.11 Paper RLHF

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
2024.11 Paper RL

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
2024.11 Paper RL

AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning
2024.11 Paper RL

LLaVA-o1: Let Vision Language Models Reason Step-by-Step
2024.11 Paper SFT

Vision-Language Models Can Self-Improve Reasoning via Reflection
2024.11 Paper RL
Star
Improve Vision Language Model Chain-of-thought Reasoning
2024.10 Github Huggingface SFT & RL

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models
2024.03 Paper SFT

Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic
2023.06 Paper SFT
⬆️ Back to Top

🚀 3.Test-time Scaling

  • [2502] [Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking] (THU) Paper

  • [2502] [MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs] (USC) Paper

  • [2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper

  • [2411] [Vision-Language Models Can Self-Improve Reasoning via Reflection] (NJU) Paper Code

  • [2402] [Scaffolding Coordinates to Promote Vision-Language Coordination in Large Multi-Modal Models] (THU) Paper Code

  • [2402] [V-STaR: Training Verifiers for Self-Taught Reasoners] (Mila, Universite de Montreal) Paper

⬆️ Back to Top

🚀 4.Collaborative Reasoning

This kind of method aims to use small models(tool or visual expert) or multiple MLLMs to do collaborative reasoning.

  • [2412] [Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension] (University of Maryland) Paper

  • [2410] [VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use] (Dartmouth College) Paper

  • [2406] [Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models] (University of Washington) Paper

  • [2409] [Visual Agents as Fast and Slow Thinkers] (UCLA) Paper

  • [2312] [Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models] (Google Research) Paper

  • [2211] [Visual Programming: Compositional visual reasoning without training] (Allen Institute for AI) Paper

⬆️ Back to Top

💰 5.MLLM Reward Model

  • [2505] R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning Paper

  • [2503] [VisualPRM: An Effective Process Reward Model for Multimodal Reasoning](Shanghai AI Lab) Paper Blog

  • [2503] [Unified Reward Model for Multimodal Understanding and Generation] (Shanghai AI Lab) Paper

  • [2502] [Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning] (University of California, Riverside) Paper

  • [2501] [InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model] (Shanghai AI Lab) Paper

  • [2410] [TLDR: Token-Level Detective Reward Model for Large Vision Language Models] (Meta) Paper

  • [2410] [FINE-GRAINED VERIFIERS: PREFERENCE MODELING AS NEXT-TOKEN PREDICTION IN VISION-LANGUAGE ALIGNMENT] (NUS) Paper

  • [2410] [LLaVA-Critic: Learning to Evaluate Multimodal Models] (ByteDance) Paper Code

⬆️ Back to Top

📊 6.Benchmarks

Benchmark Papers:

  • [2503] [CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation] (CMU) Paper

  • [2503] [reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs] (Meta) Paper

  • [2503] [How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game] (THU) Paper

  • [2502] [Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models] (FAIR) Paper Code

  • [2502] [ZeroBench: An Impossible* Visual Benchmark for Contemporary Large Multimodal Models] (University of Cambridge) Paper Code

  • [2502] [MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models] (Tencent Hunyuan Team) Paper Code

  • [2502] [MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency] (CUHK MMLab) Paper Code

  • [2410] [HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks] (CityU HK) Paper Homepage

  • [2406] [(CV-Bench)Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs] (NYU) Paper Code

  • [2404] [BLINK: Multimodal Large Language Models Can See but Not Perceive] (University of Pennsylvania) Paper

  • [2401] [(MMVP) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs] (NYU) Paper

  • [2312] [(V∗Bench) V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs] (UCSD) Paper

⬆️ Back to Top

📦 7.Applications

  • [2503] [Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models] (Emory University) Paper

⬆️ Back to Top

🛠️ Open-Source Projects

  • [MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse] Code MetaSpatial Stars

  • [R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning] Code R1-Omni Stars

  • [R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3] Code R1-V Stars Report

  • [EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework] Code EasyR1 Stars

  • [R1-Onevision:An Open-Source Multimodal Large Language Model Capable of Deep Reasoning] Paper Code R1-Onevision Stars

  • [LMM-R1] Code Paper LMM-R1 Stars

  • [VLM-R1: A stable and generalizable R1-style Large Vision-Language Model] Code VLM-R1 Stars

  • [Multi-modal Open R1] Code Multi-modal Open R1 Stars

  • [Video-R1: Towards Super Reasoning Ability in Video Understanding] Code Video-R1 Stars

  • [Open-R1-Video] Code Open-R1-Video Stars

  • [R1-Vision: Let's first take a look at the image] Code R1-Vision Stars

⬆️ Back to Top

🤝 Contributing

You’re welcome to submit new resources or paper links. Please initiate a Pull Request directly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6