CVPR25-MLLM-Paper-List

CVPR 2025 Multimodal Large Language Models Paper List

📖 Table of Contents

📖 Table of Contents

Image LLMs

LLaVA-Critic: Learning to Evaluate Multimodal Models Paper Page
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper Code
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression Paper Code
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Paper Code
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper Code
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Paper
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Paper Code
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Paper Code
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Paper Code
AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Paper Code
ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models Paper
Can Large Vision-Language Models Correct Grounding Errors By Themselves? Paper
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Paper Page
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection Paper Code
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models Paper
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens Paper
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Paper
VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper Code
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models Paper Page
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper Page
ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models Paper Page
Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction Paper
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper Code

Video LLMs

Apollo: An Exploration of Video Understanding in Large Multi-Modal Models Paper Page
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper Code
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding Paper
On the Consistency of Video Large Language Models in Temporal Comprehension Paper Code
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models Paper Code
PAVE: Patching and Adapting Video Large Language Models Paper Code
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding Paper Code
M-LLM Based Video Frame Selection for Efficient Video Understanding Paper
Adaptive Keyframe Sampling for Long Video Understanding Paper Code
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper Code
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding Paper Code
Unlocking Video-LLM via Agent-of-Thoughts Distillation Paper Page
STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training Paper
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Paper Code
NVILA: Efficient Frontier Visual Language Models Paper Code
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Paper Code
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding Paper Page

Unified LLMs

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper Code
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper Code
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling Paper Code
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation Paper Code
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat Paper Code
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Paper
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Paper Code

Other Modalities

Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces Paper Code
EventGPT: Event Stream Understanding with Multimodal Large Language Models Paper Code

Preference Optimization

Task Preference Optimization: Improving Multimodal Large Language Models Performance with Vision Task Alignment Paper Code
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization Paper Code
Continual SFT Matches Multimodal RLHF with Negative Supervision Paper
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key Paper Code

Jailbreak

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment Paper Code
Distraction is All You Need for Multimodal Large Language Model Jailbreaking Paper
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy Paper
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models Paper
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Paper Code

Retrieval

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs Paper Code
Retrieval-Augmented Personalization for Multimodal Large Language Models Paper Code

Benchmarks

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper Code
MLVU: Benchmarking Multi-task Long Video Understanding Paper Code
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper Code
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts Paper Page
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper Code
OVBench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper Code
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark Paper Code
Localizing Events in Videos with Multimodal Queries Paper Code
ICQ: Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning Paper Code
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding Paper Page
VidComposition: Can MLLMs Analyze Compositions in Compiled Video? Paper Code
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly Paper
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Paper Page
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning Paper Code
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper Code
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation Paper Page
VideoChat-Online: Towards Online Spatial-Temporal Video Understanding via Large Video Language Models Paper Page
ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper Code

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVPR25-MLLM-Paper-List

📖 Table of Contents

Image LLMs

Video LLMs

Unified LLMs

Other Modalities

Preference Optimization

Jailbreak

Retrieval

Benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CVPR25-MLLM-Paper-List

📖 Table of Contents

Image LLMs

Video LLMs

Unified LLMs

Other Modalities

Preference Optimization

Jailbreak

Retrieval

Benchmarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages