ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
This repo is the official implementation of paper: ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
Fanhu Zeng, Fei Zhu, Haiyang Guo, Xu-Yao Zhang, Cheng-Lin Liu
Key words: Large multimodal models, Continual instruction tuning, Prompt learning, Parameter efficient learning
- [2025.08.21] ModalPrompt has been accepted by EMNLP 2025 main conference! 🎉
- [2025.06.18] Check out our new survey: "Continual Learning for Generative AI: From LLMs to MLLMs and Beyond". We provide a systematic review of continual learning across mainstream generative models—including LLMs, MLLMs, Vision-Language Action Models, and Diffusion Models. Feel free to cite or open pull requests in Awesome-Continual-Learning-in-Generative-Models!
- [2025.04.12] We release Trainig and Evaluation script for ModalPrompt. Try it now! 🔥
- [2024.10.08] ModalPrompt is available on Arxiv. 🍬
Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed instruction datasets. However, novel tasks would be encountered sequentially in dynamic world, which urges for equipping LMMs with multimodal continual instruction learning (MCIT) ability especially for diverse and challenging generative tasks. Existing MCIT methods do not fully exploit the unique attribute of LMMs and often gain performance at the expense of efficiency. In this paper, we propose a novel prompt learning framework for MCIT to effectively alleviate forgetting of previous knowledge while managing computational complexity with natural image-text supervision. Concretely, we learn prompts for each task and exploit efficient prompt fusion for knowledge transfer and prompt selection for complexity management with dual-modality guidance. Extensive experiments demonstrate that our approach achieves substantial +14.26% performance gain on MCIT benchmarks with remarkable
Like LLaVA, install the packages following the steps below:
- Clone this repository
git clone https://github.com/AuroraZengfh/ModalPrompt.git
cd ModalPrompt
- Install Package
conda create -n modalprompt python=3.10 -y
conda activate modalprompt
pip install --upgrade pip
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Create models
folder, download pre-trained checkpoint for LLaVA and CLIP-Large-Patch14-336.
Create datasets
folder and download image datasets from the construction of CoIN.
Create instructions
folder and download instructions from Huggingface.
Sequentially train a foundation model on various datasets in the form of prompt tuning.
e.g., take llava-v1.5-7b as an example
sh scripts/ModalPrompt/Train/train_all.sh
training checkpoints will be placed in checkpoints/ModalPrompt
. Be careful that you should replace the paths in the scripts with your own paths.
Evaluate the model on different stages of continual instruction tuning and obtain all the results on backward transfer. During Evaluation, stage, model-path, current task and total task have to be set.
sh scripts/ModalPrompt/Eval/eval_all.sh
Note that current task is not the task identifier, but the number of trained task. Total task is constructed for conventient initialization of all prompts at one time. You can modify the code to remove the parameter of total task when the task of continual instruction learning is unknown, and initializing corresponding prompts for each new task.
-
When implemnting the codebase, we find that it may occur the prolem of initialization when training. We do not find a proper solution to this and empirically disabling line35-37 in llava/model/llava_arch.py will solve the problem. We welcome suggestions if you find a simpler solution that refrains from switching the code for training and evaluation.
-
Some of the hyper-parameters are set frozen in our code, you can set them in the parser if you want to automatically search for the best performance.
-
In the experiment of ImageNet, we slightly alter the instruction and add option choice of all category names to the prompt to avoid inaccurate descriptions.
If you find this work useful, consider giving this repository a star ⭐ and citing 📑 our paper as follows:
@article{zeng2024modalprompt,
title={ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt},
author={Zeng, Fanhu and Zhu, Fei and Guo, Haiyang and Zhang, Xu-Yao and Liu, Cheng-Lin},
journal={arXiv preprint arXiv:2410.05849},
year={2024}
}
The code is based on LLaVA. Thanks for these great works and open sourcing!
If you find them helpful, please consider citing them as well.