Skip to content

[Feature]: Multimodal Benchmarking Support (MMLM) #21887

@knlnguyen1802

Description

@knlnguyen1802

🚀 The feature, motivation and pitch

1. Motivation

vLLM’s built-in benchmark currently reports

  • TTFTTime-to-First-Token
  • TPTTokens-Per-Second
  • ITLInter-Token-Latency

Those latency-centric numbers are perfect for pure-text LLMs, but they do not capture the quality or unique execution characteristics of multimodal large models (MMLMs) that take both text and images. Adding a fit-for-purpose multimodal benchmark would make vLLM even more valuable for researchers and practitioners.


2. What is missing right now?

  1. Dataset – No out-of-the-box multimodal test set that exercises image → text or text → image abilities.
  2. Metrics – Current numbers show speed only; they don’t answer “How well is the model performing on the task?”
  3. Evaluation harness – vLLM lacks a driver that loads multimodal samples, feeds them, and aggregates both quality and latency into one report.

3. Proposed Solution

3.1 Candidate Benchmark Datasets
Category Suggested dataset License Rationale
VQA VQAv2, OK-VQA CC-BY-4.0 Classic image-to-text Q&A
Caption MS-COCO Captions CC-BY-4.0 Widely used; automatic metrics available
Reasoning MMMU, MMBench, ScienceQA CC-BY-NC-SA Tests multi-hop visual + text reasoning
Random ImageNet-1k 5 k random subset CC BY Stress-tests generic vision encoder paths

(One small, redistributable subset per task is usually enough—e.g., ~2 k images total.)

3.2 Suggested Metrics

ETIEncode-to-Token Interval: wall-time from first image byte received to first generated token.
FPS-Enc – images processed per second (encoder throughput).
• Continue to report TTFT, TPT, ITL for apples-to-apples comparison with text-only runs

Noted: Benchmark for speculative decoding is available here and support custom dataset but not friendly
https://docs.google.com/document/d/1SbAnLNfCp04lHLJ_cF22IYc_StJ_U3jRRUSc3dQTxO0/edit?pli=1&tab=t.0

cc @DarkLight1337

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions