Clarify that prefix caching is not allowed per MLPerf rules.

wangshangsam · wangshangsam · commit 92f8c7554025 · 2026-01-14T16:28:36.000-05:00
diff --git a/multimodal/qwen3-vl/README.md b/multimodal/qwen3-vl/README.md
@@ -78,7 +78,8 @@ docker run --gpus all \                                 # Use all the GPUs on th
     vllm/vllm-openai:nightly \                          # You can also use the `:latest` container or a specific release.
         --model Qwen/Qwen3-VL-235B-A22B-Instruct \      # Specifies the model for vLLM to deploy.
         --tensor-parallel-size 8 \                      # 8-way tensor-parallel inference across 8 GPUs.
-        --limit-mm-per-prompt.video 0                   # The input requests will contain images only (i.e., no videos).
+        --limit-mm-per-prompt.video 0 \                 # The input requests will contain images only (i.e., no videos).
+        --no-enable-prefix-caching                      # Disable cross-query prefix caching to satisfy MLPerf Inference rules.
 ```
 
 ### Run the benchmark for the Offline scenario
@@ -201,7 +202,8 @@ mlperf-inf-mm-q3vl benchmark vllm \
         ]
     }' \
     --vllm.cli=--limit-mm-per-prompt.video=0 \
-    --vllm.cli=--tensor-parallel-size=8 
+    --vllm.cli=--tensor-parallel-size=8 \
+    --vllm.cli=--no-enable-prefix-caching
 ```
 
 ## Slurm
@@ -232,6 +234,14 @@ bash submit.sh --help
 > example scripts to the specific settings for the Slurm cluster that you are going
 > to use, before you try to launch any jobs.
 
+## Prefix caching
+
+According to the [rules of MLPerf Inference](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#94-llm-benchmarks),
+cross-query prefix caching is disallowed, while PagedAttention or continuous batching
+are allowed. This means that, in:
+- in vLLM, you must explicitly set `--no-enable-prefix-caching`;
+- in SGLang, you must explicitly set `--disable-radix-cache`.
+
 ## Reference Implementation Specification
 
 - v6.0 Round
@@ -271,6 +281,7 @@ bash submit.sh --help
       the host memory, which takes ~6.39 GB). 
     - Testing duration $\ge$ 10 mins.
     - Sample concatenation permutation is enabled.
+    - You must explicitly set `--no-enable-prefix-caching` for vLLM.
 
 ## Plugin System for `mlperf-inf-mm-q3vl benchmark`
 
diff --git a/multimodal/qwen3-vl/scripts/slurm/benchmark.sh b/multimodal/qwen3-vl/scripts/slurm/benchmark.sh
@@ -26,4 +26,5 @@ srun \
         --vllm.cli=--max-model-len=32768 \
         --vllm.cli=--limit-mm-per-prompt.video=0 \
         --vllm.cli=--tensor-parallel-size="${TENSOR_PARALLEL_SIZE}" \
+        --vllm.cli=--no-enable-prefix-caching \
         --settings.logging.log_output.outdir="${OUTPUT_CONTAINER_DIR}"/"${SLURM_JOB_ID}"