We release MissRAG, a novel multimodal Retrieval-Augmented Generation (RAG) framework developed to address the missing modality problem in Multimodal Large Language Models (MLLMs). MissRAG is capable of simultaneously handling three modalities and supports retrieval across all possible combinations of single and multiple input modalities. Additionally, the framework integrates modality-aware textual prompts that explicitly indicate missing inputs, thereby conditioning and guiding the generation process more effectively.
This repository includes all materials necessary to reproduce our framework across five diverse datasets—Music AVQA for audio-visual question answering, Valor and CharadesEGO for audio-visual captioning, MOSI and MOSEI for multimodal sentiment analysis—on three publicly available models, namely OneLLM, VideoLLaMA 2, and ChatBridge.
In real-world scenarios, multimodal systems often face the challenge of handling cases where certain data modalities are missing or incomplete. Such issues may arise due to a range of factors, including sensor failures, hardware constraints, privacy restrictions, environmental noise, and data transmission errors. Collectively, these challenges are referred to in the literature as the missing modality problem.
MissRAG is the first multimodal Retrieval-Augmented Generation (RAG) framework that addresses the missing modality problem in MLLMs. It retrieves relevant modality data from a pool of training-set-derived prototypes when one or more inputs are absent by computing similarity scores between available and missing modalities, enabling the model to perform as if all modalities were present.
Additionally, our multimodal RAG framework is empowered with a modality-aware prompt engineering strategy that explicitly informs the model of missing inputs and guide the generation process accordingly.
-
First RAG framework to address the missing modality problem: We propose a novel retrieval-augmented approach specifically designed for handling missing modalities in multimodal large language models (MLLMs).
-
Concurrently operates across three distinct modalities: Our framework is capable of processing and retrieving audio, visual, and textual inputs in all possible single and multi-modal combinations.
-
Enhanced with the proposed prompt engineering strategy: The multimodal RAG system utilizes modality-aware textual prompts to explicitly inform the model of missing inputs and guide generation accordingly.
-
Effectively mitigate the missing modality problem with MLLMs: MissRAG effectively mitigates the missing modality problem for MLLMs across a wide range of tasks involving audio-video, and audio-video-text data.
We evaluate MissRAG on three on three publicly available MLLMs capable of handling audio, video and text modalities:
Model | Size | Download |
---|---|---|
OneLLM | 7B | link |
ChatBridge | 13B | link |
VideoLLaMA 2 | 7B | link |
Clone this repository into a local folder.
git clone https://github.com/aimagelab/MissRAG.git
cd MissRAG
Create a python env for the specific MLLM model you want to evaluate and activate it.
conda create -n onellm python=3.9
conda activate onellm
cd OneLLM
pip install -r requirements.txt
conda create -n chatbridge python=3.9
conda activate chatbridge
cd ChatBridge
pip install -r requirements.txt
conda create -n videollama2 python=3.9
conda activate videollama2
cd VideoLLaMA2
pip install -r requirements.txt
Task | Dataset | Download |
---|---|---|
Audio-visual question answering | Music AVQA | link |
Audio-visual captioning | Valor | link |
Audio-visual captioning | CharadesEGO | link |
Audio-video-text sentyment analysis | MOSI | link |
Audio-video-text sentyment analysis | MOSEI | link |
Create the pool of training-set-derived prototypes with ImageBind as contrastive embedder. Please refer to the Evaluation Guide for more details about how to create the prototypes.
MissRAG retrieves the top-k most similar prototypes from the previously constructed pool, using the available modalities as queries. In OneLLM and ChatBridge, modality tokens for audio and video inputs have a fixed length; therefore, to avoid redundant computation of these tokens for retrieved prototypes, we precompute them for the entire training set and store them in .h5 files. In contrast, VideoLLaMA 2 produces audio and video representations of variable length, which necessitates computing them at run time.
To apply our MissRAG framework to OneLLM, ChatBridge and VideoLLaMA 2, please consult the respective README files for detailed instructions.
Refer to the answer_mapping/
folder to evaluate the answers generated by the MLLMs.
Specifically, run answer_mapping/eval_music_avqa.py
script to evaluate Music AVQA predictions, answer_mapping/caption_eval.py
to evaluate Valor and CharadesEGO captions and answer_mapping/eval_MOSI_multiple_XOR.py
, answer_mapping/eval_MOSEI_multiple_XOR.py
to evaluate MOSI and MOSEI predictions. Before running, set the path in the scripts to your result file.
- ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
- OneLLM: One Framework to Align All Modalities with Language
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
- ImageBind: One Embedding Space To Bind Them Al
- Microsoft COCO Caption Evaluation
If you find this code and paper useful for your research, please kindly cite our paper.
@inproceedings{2025ICCV_missrag,
publisher={IEEE},
venue={Honolulu, Hawaii},
month={Oct},
year={2025},
pages={1--10},
booktitle={International Conference on Computer Vision},
title={{MISSRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models}},
author={Pipoli, Vittorio and Saporita, Alessia and Bolelli, Federico and Cornia, Marcella and Baraldi, Lorenzo and Grana, Costantino and Cucchiara, Rita and Ficarra, Elisa},
}