Update docs on prefix cache plugin related metrics (#1828)

liu-cong · web-flow · commit 5b22378a0218 · 2025-11-07T15:32:52.000-08:00
* Update docs on prefix cache plugin related metrics

* Address comment
diff --git a/docs/proposals/003-model-server-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md
@@ -24,13 +24,13 @@ Note the requirements here are aligned with the
 [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
 effort.
 
-The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
-into the reference endpoint picker implementation.
 
 | Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
-| ----- | ---- | ---- | ---- | ---- | ---- |
+| ----- | ---- | ------------ | ---- | ---- | ---- |
 | TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
 | KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
+| [Optional] BlockSize         | Labeled     | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| | 
+| [Optional] NumGPUBlocks| Labeled     | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| | 
 
 
 ### LoRA Adapter Serving
@@ -60,4 +60,4 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
 Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0),
 the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
 To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix
-cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
+cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
diff --git a/site-src/guides/epp-configuration/prefix-aware.md b/site-src/guides/epp-configuration/prefix-aware.md
@@ -15,43 +15,54 @@ Like any other plugins, the prefix cache aware plugin can be enabled/disabled vi
 The prefix cache plugin exposes the following advanced configuration parameters:
 
 * `blockSize`: The plugin matches prefixes in the unit of blocks. This is the size
-of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default
-is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
-extremely long inputs.
+of each block in number of bytes. At runtime, EPP can dynamically fetch this information from the
+inference engine metrics, therefore this config is only used when such metric is not available. In
+vLLM, the metric name is `vllm:cache_config_info` and the metric label is `block_size`. See the
+[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
+for more details.
+
+    vLLM default block size is 16 tokens. Assume 4 characters per token, the default
+    is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
+    extremely long inputs.
 
 * `maxPrefixBlocksToMatch`: The maximum number of blocks to find prefix match. The default is
 256 (or 256*64=16384 characters, or roughly 4096 tokens). This is useful to tradeoff prefix match accuracy
 for performance.
 
-* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). Below
-shows a detailed analysis on how to estimate this.
+* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). 
+Similar to `blockSize`, EPP can dynamically fetch this from the inference engine metrics endpoints. 
+In vLLM, the metric name is `vllm:cache_config_info` and the metric label is `num_gpu_blocks`. See the
+[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
+for more details.
+
+    If such metric is not available, you can follow the guide below on how to estimate this.
 
-    The prefix cache plugin estimates the prefix cache indexes in model server HBMs.  In the perfect
-    scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
-    the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
-    false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
-    Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
+        The prefix cache plugin estimates the prefix cache indexes in model server HBMs.  In the perfect
+        scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
+        the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
+        false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
+        Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
 
-    NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
-    in tokens, a conversion between character <-> token is needed.
+        NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
+        in tokens, a conversion between character <-> token is needed.
 
-    Below are the formulas to estimate the EPP prefix indexer size:
+        Below are the formulas to estimate the EPP prefix indexer size:
 
-    ```
-    max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
-    lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
-    ```
+        ```
+        max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
+        lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
+        ```
 
-    Let's take an example:
+        Let's take an example:
 
-    * Model: llama3 8B
-    * Accelerator: Nvidia H100 80GB
-    * Num replicas: 3
-    * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
+        * Model: llama3 8B
+        * Accelerator: Nvidia H100 80GB
+        * Num replicas: 3
+        * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
 
-    ```
-    max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
-    # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
-    # each entry is about 358KB, so the memory footrpint is abut 11 MB per server
-    lru_indexer_capacity_per_server = 500,000*4/64 = 31250
-    ```
+        ```
+        max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
+        # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
+        # each entry is about 358KB, so the memory footrpint is abut 11 MB per server
+        lru_indexer_capacity_per_server = 500,000*4/64 = 31250
+        ```