[Feature]:  Multimodal Benchmarking Support (MMLM)

### 🚀 The feature, motivation and pitch

#### 1. Motivation
vLLM’s built-in benchmark currently reports  
* **TTFT** – *Time-to-First-Token*  
* **TPT**  – *Tokens-Per-Second*  
* **ITL**  – *Inter-Token-Latency*  

Those latency-centric numbers are perfect for pure-text LLMs, but they do **not** capture the quality or unique execution characteristics of **multimodal large models (MMLMs)** that take both text *and* images.  Adding a fit-for-purpose multimodal benchmark would make vLLM even more valuable for researchers and practitioners.

---

#### 2. What is missing right now?
1. **Dataset** – No out-of-the-box multimodal test set that exercises image → text or text → image abilities.  
2. **Metrics** – Current numbers show speed only; they don’t answer *“How well is the model performing on the task?”*  
3. **Evaluation harness** – vLLM lacks a driver that loads multimodal samples, feeds them, and aggregates both **quality** *and* **latency** into one report.

---

#### 3. Proposed Solution

##### 3.1 Candidate Benchmark Datasets
| Category | Suggested dataset | License | Rationale |
|----------|------------------|---------|-----------|
| VQA      | `VQAv2`, `OK-VQA` | CC-BY-4.0 | Classic image-to-text Q&A |
| Caption  | `MS-COCO` Captions | CC-BY-4.0 | Widely used; automatic metrics available |
| Reasoning | `MMMU`, `MMBench`, `ScienceQA` | CC-BY-NC-SA | Tests multi-hop visual + text reasoning |
| Random   | *ImageNet-1k* 5 k random subset | CC BY | Stress-tests generic vision encoder paths |

(One small, redistributable subset per task is usually enough—e.g., ~2 k images total.)

##### 3.2 Suggested Metrics
   • **ETI** – *Encode-to-Token Interval*: wall-time from first image byte received to first generated token.  
   • **FPS-Enc** – images processed per second (encoder throughput).  
   • Continue to report TTFT, TPT, ITL for apples-to-apples comparison with text-only runs

Noted: Benchmark for speculative decoding is available here and support custom dataset but not friendly 
https://docs.google.com/document/d/1SbAnLNfCp04lHLJ_cF22IYc_StJ_U3jRRUSc3dQTxO0/edit?pli=1&tab=t.0

cc @DarkLight1337  

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Multimodal Benchmarking Support (MMLM) #21887

🚀 The feature, motivation and pitch

1. Motivation

2. What is missing right now?

3. Proposed Solution

3.1 Candidate Benchmark Datasets

3.2 Suggested Metrics

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Category	Suggested dataset	License	Rationale
VQA	`VQAv2`, `OK-VQA`	CC-BY-4.0	Classic image-to-text Q&A
Caption	`MS-COCO` Captions	CC-BY-4.0	Widely used; automatic metrics available
Reasoning	`MMMU`, `MMBench`, `ScienceQA`	CC-BY-NC-SA	Tests multi-hop visual + text reasoning
Random	ImageNet-1k 5 k random subset	CC BY	Stress-tests generic vision encoder paths

Uh oh!

[Feature]: Multimodal Benchmarking Support (MMLM) #21887

Description

🚀 The feature, motivation and pitch

1. Motivation

2. What is missing right now?

3. Proposed Solution

3.1 Candidate Benchmark Datasets

3.2 Suggested Metrics

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions