Skip to content

Commit 92f8c75

Browse files
committed
Clarify that prefix caching is not allowed per MLPerf rules.
1 parent 7651400 commit 92f8c75

File tree

2 files changed

+14
-2
lines changed

2 files changed

+14
-2
lines changed

multimodal/qwen3-vl/README.md

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,8 @@ docker run --gpus all \ # Use all the GPUs on th
7878
vllm/vllm-openai:nightly \ # You can also use the `:latest` container or a specific release.
7979
--model Qwen/Qwen3-VL-235B-A22B-Instruct \ # Specifies the model for vLLM to deploy.
8080
--tensor-parallel-size 8 \ # 8-way tensor-parallel inference across 8 GPUs.
81-
--limit-mm-per-prompt.video 0 # The input requests will contain images only (i.e., no videos).
81+
--limit-mm-per-prompt.video 0 \ # The input requests will contain images only (i.e., no videos).
82+
--no-enable-prefix-caching # Disable cross-query prefix caching to satisfy MLPerf Inference rules.
8283
```
8384

8485
### Run the benchmark for the Offline scenario
@@ -201,7 +202,8 @@ mlperf-inf-mm-q3vl benchmark vllm \
201202
]
202203
}' \
203204
--vllm.cli=--limit-mm-per-prompt.video=0 \
204-
--vllm.cli=--tensor-parallel-size=8
205+
--vllm.cli=--tensor-parallel-size=8 \
206+
--vllm.cli=--no-enable-prefix-caching
205207
```
206208

207209
## Slurm
@@ -232,6 +234,14 @@ bash submit.sh --help
232234
> example scripts to the specific settings for the Slurm cluster that you are going
233235
> to use, before you try to launch any jobs.
234236
237+
## Prefix caching
238+
239+
According to the [rules of MLPerf Inference](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#94-llm-benchmarks),
240+
cross-query prefix caching is disallowed, while PagedAttention or continuous batching
241+
are allowed. This means that, in:
242+
- in vLLM, you must explicitly set `--no-enable-prefix-caching`;
243+
- in SGLang, you must explicitly set `--disable-radix-cache`.
244+
235245
## Reference Implementation Specification
236246

237247
- v6.0 Round
@@ -271,6 +281,7 @@ bash submit.sh --help
271281
the host memory, which takes ~6.39 GB).
272282
- Testing duration $\ge$ 10 mins.
273283
- Sample concatenation permutation is enabled.
284+
- You must explicitly set `--no-enable-prefix-caching` for vLLM.
274285

275286
## Plugin System for `mlperf-inf-mm-q3vl benchmark`
276287

multimodal/qwen3-vl/scripts/slurm/benchmark.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,5 @@ srun \
2626
--vllm.cli=--max-model-len=32768 \
2727
--vllm.cli=--limit-mm-per-prompt.video=0 \
2828
--vllm.cli=--tensor-parallel-size="${TENSOR_PARALLEL_SIZE}" \
29+
--vllm.cli=--no-enable-prefix-caching \
2930
--settings.logging.log_output.outdir="${OUTPUT_CONTAINER_DIR}"/"${SLURM_JOB_ID}"

0 commit comments

Comments
 (0)