CVPR 2025 Multimodal Large Language Models Paper List
- LLaVA-Critic: Learning to Evaluate Multimodal Models Paper Page
- Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper Code
- FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression Paper Code
- BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Paper Code
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper Code
- Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Paper
- Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Paper Code
- DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Paper Code
- ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper
- Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Paper Code
- AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Paper Code
- ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models Paper
- Can Large Vision-Language Models Correct Grounding Errors By Themselves? Paper
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Paper Page
- Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection Paper Code
- HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models Paper
- Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens Paper
- HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Paper
- VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper Code
- Perception Tokens Enhance Visual Reasoning in Multimodal Language Models Paper Page
- Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper Page
- ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models Paper Page
- Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction Paper
- VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper Code
- Apollo: An Exploration of Video Understanding in Large Multi-Modal Models Paper Page
- VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper Code
- Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding Paper
- On the Consistency of Video Large Language Models in Temporal Comprehension Paper Code
- DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models Paper Code
- PAVE: Patching and Adapting Video Large Language Models Paper Code
- DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding Paper Code
- M-LLM Based Video Frame Selection for Efficient Video Understanding Paper
- Adaptive Keyframe Sampling for Long Video Understanding Paper Code
- VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper Code
- LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding Paper Code
- Unlocking Video-LLM via Agent-of-Thoughts Distillation Paper Page
- STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training Paper
- PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Paper Code
- NVILA: Efficient Frontier Visual Language Models Paper Code
- Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Paper Code
- HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding Paper Page
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper Code
- JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper Code
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling Paper Code
- CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation Paper Code
- WeGen: A Unified Model for Interactive Multimodal Generation as We Chat Paper Code
- SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Paper
- TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Paper Code
- Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces Paper Code
- EventGPT: Event Stream Understanding with Multimodal Large Language Models Paper Code
- Task Preference Optimization: Improving Multimodal Large Language Models Performance with Vision Task Alignment Paper Code
- SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization Paper Code
- Continual SFT Matches Multimodal RLHF with Negative Supervision Paper
- Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key Paper Code
- Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment Paper Code
- Distraction is All You Need for Multimodal Large Language Model Jailbreaking Paper
- Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy Paper
- Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models Paper
- Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Paper Code
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs Paper Code
- Retrieval-Augmented Personalization for Multimodal Large Language Models Paper Code
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper Code
- MLVU: Benchmarking Multi-task Long Video Understanding Paper Code
- MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper Code
- MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts Paper Page
- VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper Code
- OVBench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper Code
- ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark Paper Code
- Localizing Events in Videos with Multimodal Queries Paper Code
- ICQ: Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning Paper Code
- VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding Paper Page
- VidComposition: Can MLLMs Analyze Compositions in Compiled Video? Paper Code
- Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly Paper
- VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Paper Page
- Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning Paper Code
- OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper Code
- Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation Paper Page
- VideoChat-Online: Towards Online Spatial-Temporal Video Understanding via Large Video Language Models Paper Page
- ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper Code