Learning to Inference Adaptively for Multimodal Large Language Models
Zhuoyan Xu*, Khoi Duc Nguyen*, Preeti Mukherjee, Saurabh Bagchi , Somali Chaterji, Yingyu Liang, Yin Li
*Equal Contribution
[Paper] [Project Page] [Model Zoo]
- [3/17/2025] π₯ We released AdaLLaVA evaluation code. We intergrated popular tools
lmms-evalwithLLM-viewerto evaluate on various benchmarks, while computing FLOPs, time, memory etc. during evaluation. - [3/17/2025] π₯ We released AdaLLaVA. We propose a dynamic inference approach for multimodal Large Language Models that operates efficiently under resource constraints. Checkout our paper.
Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.
- Clone this repository and navigate to the folder
git clone https://github.com/zhuoyan-xu/AdaLLaVA.git
cd AdaLLaVA- Create Environment
conda create -n adallava python=3.10 -y
conda activate adallava- Install lmms_eval
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
git checkout 80391ce3bfb5a19b32e7a19a2d9399e1378ed2dd
pip install -e .
cd ..- Install LLaVa
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install protobuf
pip install -e ".[train]"
pip install peft==0.13.2
pip install flash-attn==2.5.2 --no-build-isolation
cd ..- Install adallava
pip install -e .from src.adallava.eval.run_ada_llava import eval_model
model_path = "zhuoyanxu/ada-llava-L-v1.5-7b"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"
args = type('Args', (), {
"model_path": model_path,
"model_name": 'ada_llava_llama',
"query": prompt,
"conv_mode": None,
"image_file": image_file,
"sep": ",",
"temperature": 0,
"top_p": None,
"num_beams": 1,
"max_new_tokens": 512,
"latency": 1.0,
"hardware": "nvidia_V100",
})()
eval_model(args)Please check out our Model Zoo for all public AdaLLaVA checkpoints, and the instructions of how to use the weights.
We follow original LLaVA repository and use their stage-2:Visual Instruction Tuning data. See details for prepare dataset in Train.
In AdaLLaVA, we evaluate models on a existing benchmarks using official toolkit lmms-eval to ensure the reproducibility.
We integrate lmms-eval with LLM-viewer to compute the FLOPs and time during evaluation.
To evaluate Ada-LLaVa on difference latency constraints, please change the latency in the model_args. For example, to evaluate adallava with 85% latency constraint, run
python3 -m accelerate.commands.launch \
-m adallava.eval.run_lmms_eval \
--model adallava \
--model_args pretrained=zhuoyanxu/ada-llava-L-v1.5-7b,latency=0.85 \
--tasks mme,pope,mmbench_en_dev,scienceqa_img,textvqa_val \
--batch_size 1 \
--log_samples \
--log_samples_suffix adallava_0.85 \
--output_path ./logs_0.85/For textvqa_val, please set the OCR incorporation to True.
The result file contains the evaluation metric score on each benchmark, along with the evaluation time, FLOPs and memory. Example output of MME:
"mme": {
"alias": "mme",
"flops,flops": 7239070670529.51,
"flops_stderr,flops": "N/A",
"avg_flops,avg_flops": 3615471530841.294,
"avg_flops_stderr,avg_flops": "N/A",
"prefill_flops,prefill_flops": 7227678532923.026,
"prefill_flops_stderr,prefill_flops": "N/A",
"prefill_time,prefill_time": 0.06871909813745529,
"prefill_time_stderr,prefill_time": "N/A",
"memory_consumption,memory_consumption": 22598248812.697556,
"memory_consumption_stderr,memory_consumption": "N/A",
"prefill_memory_consumption,prefill_memory_consumption": 11342271151.824768,
"prefill_memory_consumption_stderr,prefill_memory_consumption": "N/A",
"mme_cognition_score,none": 324.6428571428571,
"mme_cognition_score_stderr,none": "N/A",
"mme_perception_score,none": 1487.19037615046,
"mme_perception_score_stderr,none": "N/A"
}Certain benchmarks requires submitting results to official server, such as VQAv2. Here we provide steps for evaluating on VQAv2 testdev split, following the same setting as LLaVA.
python3 -m accelerate.commands.launch \
-m adallava.eval.run_lmms_eval \
--model adallava \
--model_args pretrained=zhuoyanxu/ada-llava-L-v1.5-7b,latency=0.85 \
--tasks vqav2_test \
--batch_size 1 \
--log_samples \
--log_samples_suffix adallava_0.85 \
--output_path ./logs_0.85_vqav2/For vqav2_test, please set the test_split to testdev. After running lmms-eval, convert the submission file with
python scripts/convert/convert_lmms_vqav2.py --src $SRC --dst $DST
where SRC is the json file under ./logs_0.85_vqav2/submission/, DST is the path of converted submission file.
Submit the converted submission file to VQA Challenge 2021.
Follow instructions from here to download images from these 5 datasets for LLaVA v1.5 fine-tuning. Put the zip files in the corresponding folders and unzip them. image path:
LLaVA-Finetune
βββ images
βΒ Β βββ coco
βΒ Β βΒ Β βββ train2017
βΒ Β βββ gqa
βΒ Β βΒ Β βββ images
βΒ Β βββ ocr_vqa
βΒ Β βΒ Β βββ images
βΒ Β βββ textvqa
βΒ Β βΒ Β βββ train_images
βΒ Β βββ vg
βΒ Β βββ VG_100K
βΒ Β βββ VG_100K_2
βββ llava_v1_5_mix665k.jsondownload instruction tuning data from here into ./LLaVA-Finetune/llava_v1_5_mix665k.json.
Our training directly follow original LLaVA repository stage-2:Visual Instruction Tuning. We load pretrained checkpoint of llava 1.5 and random initilize weights for scheduler.
Training script with DeepSpeed ZeRO-3: train_script.sh. Set the model_name_or_path parameter to the path of your pre-trained llava checkpoints, such as liuhaotian/llava-v1.5-7b. The trained AdaLLaVA model will be saved at the specified output_dir.
If you find AdaLLaVA useful for your research and applications, please cite using this BibTeX:
@InProceedings{xu2025adallava,
author = {Xu, Zhuoyan and Nguyen, Khoi Duc and Mukherjee, Preeti and Bagchi, Saurabh and Chaterji, Somali and Liang, Yingyu and Li, Yin},
title = {Learning to Inference Adaptively for Multimodal Large Language Models},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {3552-3563}
}
-
LLaVA: The codebase we built upon.
-
LLM-Viewer: The code we used for calculating FLOPs, prefill time, etc..
-
lmms-eval: The code for evaluating multimodal LLMs.
