AdaLLaVA: Learning to Inference Adaptively for Multimodal Large Language Models

Learning to Inference Adaptively for Multimodal Large Language Models

Zhuoyan Xu*, Khoi Duc Nguyen*, Preeti Mukherjee, Saurabh Bagchi , Somali Chaterji, Yingyu Liang, Yin Li

*Equal Contribution

Release

[3/17/2025] 🔥 We released AdaLLaVA evaluation code. We intergrated popular tools lmms-eval with LLM-viewer to evaluate on various benchmarks, while computing FLOPs, time, memory etc. during evaluation.
[3/17/2025] 🔥 We released AdaLLaVA. We propose a dynamic inference approach for multimodal Large Language Models that operates efficiently under resource constraints. Checkout our paper.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

from src.adallava.eval.run_ada_llava import eval_model

model_path = "zhuoyanxu/ada-llava-L-v1.5-7b"
prompt = "What are the things I should be cautious about when I visit here?"
image_file = "https://llava-vl.github.io/static/images/view.jpg"

args = type('Args', (), {
    "model_path": model_path,
    "model_name": 'ada_llava_llama',
    "query": prompt,
    "conv_mode": None,
    "image_file": image_file,
    "sep": ",",
    "temperature": 0,
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512,
    "latency": 1.0,
    "hardware": "nvidia_V100",
})()

eval_model(args)

AdaLLaVA Weights

Please check out our Model Zoo for all public AdaLLaVA checkpoints, and the instructions of how to use the weights.

Dataset

We follow original LLaVA repository and use their stage-2:Visual Instruction Tuning data. See details for prepare dataset in Train.

Evaluation

In AdaLLaVA, we evaluate models on a existing benchmarks using official toolkit lmms-eval to ensure the reproducibility. We integrate lmms-eval with LLM-viewer to compute the FLOPs and time during evaluation.

Evaluation on various benchmarks

To evaluate Ada-LLaVa on difference latency constraints, please change the latency in the model_args. For example, to evaluate adallava with 85% latency constraint, run

python3 -m accelerate.commands.launch \
    -m adallava.eval.run_lmms_eval \
    --model adallava \
    --model_args pretrained=zhuoyanxu/ada-llava-L-v1.5-7b,latency=0.85 \
    --tasks mme,pope,mmbench_en_dev,scienceqa_img,textvqa_val \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix adallava_0.85 \
    --output_path ./logs_0.85/

For textvqa_val, please set the OCR incorporation to True.

The result file contains the evaluation metric score on each benchmark, along with the evaluation time, FLOPs and memory. Example output of MME:

"mme": {
      "alias": "mme",
      "flops,flops": 7239070670529.51,
      "flops_stderr,flops": "N/A",
      "avg_flops,avg_flops": 3615471530841.294,
      "avg_flops_stderr,avg_flops": "N/A",
      "prefill_flops,prefill_flops": 7227678532923.026,
      "prefill_flops_stderr,prefill_flops": "N/A",
      "prefill_time,prefill_time": 0.06871909813745529,
      "prefill_time_stderr,prefill_time": "N/A",
      "memory_consumption,memory_consumption": 22598248812.697556,
      "memory_consumption_stderr,memory_consumption": "N/A",
      "prefill_memory_consumption,prefill_memory_consumption": 11342271151.824768,
      "prefill_memory_consumption_stderr,prefill_memory_consumption": "N/A",
      "mme_cognition_score,none": 324.6428571428571,
      "mme_cognition_score_stderr,none": "N/A",
      "mme_perception_score,none": 1487.19037615046,
      "mme_perception_score_stderr,none": "N/A"
    }

Evaluation on benchmarks requiring submission

Certain benchmarks requires submitting results to official server, such as VQAv2. Here we provide steps for evaluating on VQAv2 testdev split, following the same setting as LLaVA.

python3 -m accelerate.commands.launch \
    -m adallava.eval.run_lmms_eval \
    --model adallava \
    --model_args pretrained=zhuoyanxu/ada-llava-L-v1.5-7b,latency=0.85 \
    --tasks vqav2_test \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix adallava_0.85 \
    --output_path ./logs_0.85_vqav2/

For vqav2_test, please set the test_split to testdev. After running lmms-eval, convert the submission file with

python scripts/convert/convert_lmms_vqav2.py --src $SRC --dst $DST

where SRC is the json file under ./logs_0.85_vqav2/submission/, DST is the path of converted submission file. Submit the converted submission file to VQA Challenge 2021.

Evaluate using original LLaVA-1.5 evaluation script

See Evaluation_LLaVA.md

Train

Prepare data

Follow instructions from here to download images from these 5 datasets for LLaVA v1.5 fine-tuning. Put the zip files in the corresponding folders and unzip them. image path:

LLaVA-Finetune
├── images
│   ├── coco
│   │   └── train2017
│   ├── gqa
│   │   └── images
│   ├── ocr_vqa
│   │   └── images
│   ├── textvqa
│   │   └── train_images
│   └── vg
│       ├── VG_100K
│       └── VG_100K_2
└── llava_v1_5_mix665k.json

download instruction tuning data from here into ./LLaVA-Finetune/llava_v1_5_mix665k.json.

Joint training of model and scheduler

Our training directly follow original LLaVA repository stage-2:Visual Instruction Tuning. We load pretrained checkpoint of llava 1.5 and random initilize weights for scheduler.

Training script with DeepSpeed ZeRO-3: train_script.sh. Set the model_name_or_path parameter to the path of your pre-trained llava checkpoints, such as liuhaotian/llava-v1.5-7b. The trained AdaLLaVA model will be saved at the specified output_dir.

Citation

If you find AdaLLaVA useful for your research and applications, please cite using this BibTeX:

@InProceedings{xu2025adallava,
        author    = {Xu, Zhuoyan and Nguyen, Khoi Duc and Mukherjee, Preeti and Bagchi, Saurabh and Chaterji, Somali and Liang, Yingyu and Li, Yin},
        title     = {Learning to Inference Adaptively for Multimodal Large Language Models},
        booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
        month     = {October},
        year      = {2025},
        pages     = {3552-3563}
    }

Acknowledgement

LLaVA: The codebase we built upon.
LLM-Viewer: The code we used for calculating FLOPs, prefill time, etc..
lmms-eval: The code for evaluating multimodal LLMs.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
scripts		scripts
src/adallava		src/adallava
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdaLLaVA: Learning to Inference Adaptively for Multimodal Large Language Models

Release

Contents

Install Package

Quick Start

AdaLLaVA Weights

Dataset

Evaluation

Evaluation on various benchmarks

Evaluation on benchmarks requiring submission

Evaluate using original LLaVA-1.5 evaluation script

Train

Prepare data

Joint training of model and scheduler

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

ChulanZhang/AdaLLaVA-benchmark

Folders and files

Latest commit

History

Repository files navigation

AdaLLaVA: Learning to Inference Adaptively for Multimodal Large Language Models

Release

Contents

Install Package

Quick Start

AdaLLaVA Weights

Dataset

Evaluation

Evaluation on various benchmarks

Evaluation on benchmarks requiring submission

Evaluate using original LLaVA-1.5 evaluation script

Train

Prepare data

Joint training of model and scheduler

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages