GitHub - aimagelab/MissRAG: [ICCV 2025] MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

We release MissRAG, a novel multimodal Retrieval-Augmented Generation (RAG) framework developed to address the missing modality problem in Multimodal Large Language Models (MLLMs). MissRAG is capable of simultaneously handling three modalities and supports retrieval across all possible combinations of single and multiple input modalities. Additionally, the framework integrates modality-aware textual prompts that explicitly indicate missing inputs, thereby conditioning and guiding the generation process more effectively.

This repository includes all materials necessary to reproduce our framework across five diverse datasets—Music AVQA for audio-visual question answering, Valor and CharadesEGO for audio-visual captioning, MOSI and MOSEI for multimodal sentiment analysis—on three publicly available models, namely OneLLM, VideoLLaMA 2, and ChatBridge.

📜 Overview

Introduction

In real-world scenarios, multimodal systems often face the challenge of handling cases where certain data modalities are missing or incomplete. Such issues may arise due to a range of factors, including sensor failures, hardware constraints, privacy restrictions, environmental noise, and data transmission errors. Collectively, these challenges are referred to in the literature as the missing modality problem.

MissRAG is the first multimodal Retrieval-Augmented Generation (RAG) framework that addresses the missing modality problem in MLLMs. It retrieves relevant modality data from a pool of training-set-derived prototypes when one or more inputs are absent by computing similarity scores between available and missing modalities, enabling the model to perform as if all modalities were present.

Additionally, our multimodal RAG framework is empowered with a modality-aware prompt engineering strategy that explicitly informs the model of missing inputs and guide the generation process accordingly.

Key Features

First RAG framework to address the missing modality problem: We propose a novel retrieval-augmented approach specifically designed for handling missing modalities in multimodal large language models (MLLMs).
Concurrently operates across three distinct modalities: Our framework is capable of processing and retrieving audio, visual, and textual inputs in all possible single and multi-modal combinations.
Enhanced with the proposed prompt engineering strategy: The multimodal RAG system utilizes modality-aware textual prompts to explicitly inform the model of missing inputs and guide generation accordingly.
Effectively mitigate the missing modality problem with MLLMs: MissRAG effectively mitigates the missing modality problem for MLLMs across a wide range of tasks involving audio-video, and audio-video-text data.

MissRAG Framework

Overview of three different scenarios: (a) complete modality scenario where both video and audio are available; (b) missing video scenario; (c) missing video scenario where our MissRAG+PE approach retrieves a prototype video while employing a designed textual prompt to mitigate the impact of the missing modality.

Overview of our MissRAG framework with three modalities. (a) Creation of modality embeddings through a contrastive embedder. (b) Retrieval of the top-k most similar prototypes by computing similarity scores between the embeddings of available modalities (i.e. query) and the stored embeddings of the missing modality via dot product, then aggregated to obtain the missing modality representation. Dashed arrows indicate that the second modality may be unavailable.

🏛️ Models

We evaluate MissRAG on three on three publicly available MLLMs capable of handling audio, video and text modalities:

Model	Size	Download
OneLLM	7B	link
ChatBridge	13B	link
VideoLLaMA 2	7B	link

🛠️ Installation

Clone this repository into a local folder.

git clone https://github.com/aimagelab/MissRAG.git
cd MissRAG

Create a python env for the specific MLLM model you want to evaluate and activate it.

conda create -n onellm python=3.9 
conda activate onellm
cd OneLLM
pip install -r requirements.txt

conda create -n chatbridge python=3.9 
conda activate chatbridge
cd ChatBridge
pip install -r requirements.txt

conda create -n videollama2 python=3.9 
conda activate videollama2
cd VideoLLaMA2
pip install -r requirements.txt

🗂️ Datasets

Task	Dataset	Download
Audio-visual question answering	Music AVQA	link
Audio-visual captioning	Valor	link
Audio-visual captioning	CharadesEGO	link
Audio-video-text sentyment analysis	MOSI	link
Audio-video-text sentyment analysis	MOSEI	link

⚙️ Method

Creation of the Modality Poolings

Create the pool of training-set-derived prototypes with ImageBind as contrastive embedder. Please refer to the Evaluation Guide for more details about how to create the prototypes.

Retrieval-Augmented Generation (RAG) system + Prompt Engineering

- (if necessary) Precompute the Modality Tokens of the Training Sets

MissRAG retrieves the top-k most similar prototypes from the previously constructed pool, using the available modalities as queries. In OneLLM and ChatBridge, modality tokens for audio and video inputs have a fixed length; therefore, to avoid redundant computation of these tokens for retrieved prototypes, we precompute them for the entire training set and store them in .h5 files. In contrast, VideoLLaMA 2 produces audio and video representations of variable length, which necessitates computing them at run time.

- Apply MissRAG to the MLLMs

To apply our MissRAG framework to OneLLM, ChatBridge and VideoLLaMA 2, please consult the respective README files for detailed instructions.

- Evaluate the answers

Refer to the answer_mapping/ folder to evaluate the answers generated by the MLLMs. Specifically, run answer_mapping/eval_music_avqa.py script to evaluate Music AVQA predictions, answer_mapping/caption_eval.py to evaluate Valor and CharadesEGO captions and answer_mapping/eval_MOSI_multiple_XOR.py, answer_mapping/eval_MOSEI_multiple_XOR.pyto evaluate MOSI and MOSEI predictions. Before running, set the path in the scripts to your result file.

📚 References

🔗 Citing our work

If you find this code and paper useful for your research, please kindly cite our paper.

@inproceedings{2025ICCV_missrag,
	publisher={IEEE},
	venue={Honolulu, Hawaii},
	month={Oct},
	year={2025},
	pages={1--10},
	booktitle={International Conference on Computer Vision},
	title={{MISSRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models}},
	author={Pipoli, Vittorio and Saporita, Alessia and Bolelli, Federico and Cornia, Marcella and Baraldi, Lorenzo and Grana, Costantino and Cucchiara, Rita and Ficarra, Elisa},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ChatBridge		ChatBridge
ImageBind		ImageBind
OneLLM		OneLLM
VideoLLaMA2		VideoLLaMA2
answer_mapping		answer_mapping
docs		docs
figs		figs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

📑 Contents

📜 Overview

Introduction

Key Features

MissRAG Framework

🏛️ Models

🛠️ Installation

🗂️ Datasets

⚙️ Method

Creation of the Modality Poolings

Retrieval-Augmented Generation (RAG) system + Prompt Engineering

- (if necessary) Precompute the Modality Tokens of the Training Sets

- Apply MissRAG to the MLLMs

- Evaluate the answers

📚 References

🔗 Citing our work

About

Uh oh!

Releases

Packages

Languages

aimagelab/MissRAG

Folders and files

Latest commit

History

Repository files navigation

MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

📑 Contents

📜 Overview

Introduction

Key Features

MissRAG Framework

🏛️ Models

🛠️ Installation

🗂️ Datasets

⚙️ Method

Creation of the Modality Poolings

Retrieval-Augmented Generation (RAG) system + Prompt Engineering

- (if necessary) Precompute the Modality Tokens of the Training Sets

- Apply MissRAG to the MLLMs

- Evaluate the answers

📚 References

🔗 Citing our work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages