Skip to content

Wang-Xiaodong1899/Awesome-Multimodal-Large-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 

Repository files navigation

CVPR25-MLLM-Paper-List

CVPR 2025 Multimodal Large Language Models Paper List

📖 Table of Contents

Image LLMs

  • LLaVA-Critic: Learning to Evaluate Multimodal Models Paper Page
  • Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper Code
  • FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression Paper Code
  • BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Paper Code
  • Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper Code
  • Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Paper
  • Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Paper Code
  • DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models Paper Code
  • ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper
  • Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Paper Code
  • AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Paper Code
  • ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models Paper
  • Can Large Vision-Language Models Correct Grounding Errors By Themselves? Paper
  • Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Paper Page
  • Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection Paper Code
  • HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models Paper
  • Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens Paper
  • HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Paper
  • VoCo-LLaMA: Towards Vision Compression with Large Language Models Paper Code
  • Perception Tokens Enhance Visual Reasoning in Multimodal Language Models Paper Page
  • Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Paper Page
  • ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models Paper Page
  • Accelerating Multimodel Large Language Models by Searching Optimal Vision Token Reduction Paper
  • VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper Code

Video LLMs

  • Apollo: An Exploration of Video Understanding in Large Multi-Modal Models Paper Page
  • VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper Code
  • Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding Paper
  • On the Consistency of Video Large Language Models in Temporal Comprehension Paper Code
  • DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models Paper Code
  • PAVE: Patching and Adapting Video Large Language Models Paper Code
  • DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding Paper Code
  • M-LLM Based Video Frame Selection for Efficient Video Understanding Paper
  • Adaptive Keyframe Sampling for Long Video Understanding Paper Code
  • VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper Code
  • LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding Paper Code
  • Unlocking Video-LLM via Agent-of-Thoughts Distillation Paper Page
  • STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training Paper
  • PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Paper Code
  • NVILA: Efficient Frontier Visual Language Models Paper Code
  • Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Paper Code
  • HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding Paper Page

Unified LLMs

  • Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Paper Code
  • JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Paper Code
  • MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling Paper Code
  • CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation Paper Code
  • WeGen: A Unified Model for Interactive Multimodal Generation as We Chat Paper Code
  • SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Paper
  • TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Paper Code

Other Modalities

  • Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces Paper Code
  • EventGPT: Event Stream Understanding with Multimodal Large Language Models Paper Code

Preference Optimization

  • Task Preference Optimization: Improving Multimodal Large Language Models Performance with Vision Task Alignment Paper Code
  • SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization Paper Code
  • Continual SFT Matches Multimodal RLHF with Negative Supervision Paper
  • Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key Paper Code

Jailbreak

  • Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment Paper Code
  • Distraction is All You Need for Multimodal Large Language Model Jailbreaking Paper
  • Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy Paper
  • Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models Paper
  • Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks Paper Code

Retrieval

  • GME: Improving Universal Multimodal Retrieval by Multimodal LLMs Paper Code
  • Retrieval-Augmented Personalization for Multimodal Large Language Models Paper Code

Benchmarks

  • Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper Code
  • MLVU: Benchmarking Multi-task Long Video Understanding Paper Code
  • MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper Code
  • MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts Paper Page
  • VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation Paper Code
  • OVBench: How Far is Your Video-LLMs from Real-World Online Video Understanding? Paper Code
  • ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark Paper Code
  • Localizing Events in Videos with Multimodal Queries Paper Code
  • ICQ: Is `Right' Right? Enhancing Object Orientation Understanding in Multimodal Language Models through Egocentric Instruction Tuning Paper Code
  • VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding Paper Page
  • VidComposition: Can MLLMs Analyze Compositions in Compiled Video? Paper Code
  • Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly Paper
  • VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Paper Page
  • Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning Paper Code
  • OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Paper Code
  • Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation Paper Page
  • VideoChat-Online: Towards Online Spatial-Temporal Video Understanding via Large Video Language Models Paper Page
  • ODE: Open-Set Evaluation of Hallucinations in Multimodal Large Language Models Paper
  • VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper Code

About

🔥Awesome Multimodal Large Language Models Paper List

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors