@@ -78,7 +78,8 @@ docker run --gpus all \ # Use all the GPUs on th
7878 vllm/vllm-openai:nightly \ # You can also use the `:latest` container or a specific release.
7979 --model Qwen/Qwen3-VL-235B-A22B-Instruct \ # Specifies the model for vLLM to deploy.
8080 --tensor-parallel-size 8 \ # 8-way tensor-parallel inference across 8 GPUs.
81- --limit-mm-per-prompt.video 0 # The input requests will contain images only (i.e., no videos).
81+ --limit-mm-per-prompt.video 0 \ # The input requests will contain images only (i.e., no videos).
82+ --no-enable-prefix-caching # Disable cross-query prefix caching to satisfy MLPerf Inference rules.
8283```
8384
8485### Run the benchmark for the Offline scenario
@@ -201,7 +202,8 @@ mlperf-inf-mm-q3vl benchmark vllm \
201202 ]
202203 }' \
203204 --vllm.cli=--limit-mm-per-prompt.video=0 \
204- --vllm.cli=--tensor-parallel-size=8
205+ --vllm.cli=--tensor-parallel-size=8 \
206+ --vllm.cli=--no-enable-prefix-caching
205207```
206208
207209## Slurm
@@ -232,6 +234,14 @@ bash submit.sh --help
232234> example scripts to the specific settings for the Slurm cluster that you are going
233235> to use, before you try to launch any jobs.
234236
237+ ## Prefix caching
238+
239+ According to the [ rules of MLPerf Inference] ( https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#94-llm-benchmarks ) ,
240+ cross-query prefix caching is disallowed, while PagedAttention or continuous batching
241+ are allowed. This means that, in:
242+ - in vLLM, you must explicitly set ` --no-enable-prefix-caching ` ;
243+ - in SGLang, you must explicitly set ` --disable-radix-cache ` .
244+
235245## Reference Implementation Specification
236246
237247- v6.0 Round
@@ -271,6 +281,7 @@ bash submit.sh --help
271281 the host memory, which takes ~ 6.39 GB).
272282 - Testing duration $\ge$ 10 mins.
273283 - Sample concatenation permutation is enabled.
284+ - You must explicitly set ` --no-enable-prefix-caching ` for vLLM.
274285
275286## Plugin System for ` mlperf-inf-mm-q3vl benchmark `
276287
0 commit comments