Skip to content

Commit 5b22378

Browse files
authored
Update docs on prefix cache plugin related metrics (#1828)
* Update docs on prefix cache plugin related metrics * Address comment
1 parent 3e930cb commit 5b22378

File tree

2 files changed

+43
-32
lines changed

2 files changed

+43
-32
lines changed

docs/proposals/003-model-server-protocol/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,13 +24,13 @@ Note the requirements here are aligned with the
2424
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
2525
effort.
2626

27-
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
28-
into the reference endpoint picker implementation.
2927

3028
| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM| SGLang |
31-
| ----- | ---- | ---- | ---- | ---- | ---- |
29+
| ----- | ---- | ------------ | ---- | ---- | ---- |
3230
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`| `sglang:num_queue_reqs`
3331
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`| `sglang:token_usage`
32+
| [Optional] BlockSize | Labeled | The block size in tokens to allocate memory, used by the prefix cache scorer. If this metric is not available, the BlockSize will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `block_size`| |
33+
| [Optional] NumGPUBlocks| Labeled | The total number of blocks in the HBM KV cache, used by the prefix cache scorer. If this metric is not available, the NumGPUBlocks will be derived from the [prefix plugin config](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/#customize-the-prefix-cache-plugin).| name: `vllm:cache_config_info`, label name: `num_gpu_blocks`| |
3434

3535

3636
### LoRA Adapter Serving
@@ -60,4 +60,4 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
6060
Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0),
6161
the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
6262
To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix
63-
cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
63+
cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.

site-src/guides/epp-configuration/prefix-aware.md

Lines changed: 39 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -15,43 +15,54 @@ Like any other plugins, the prefix cache aware plugin can be enabled/disabled vi
1515
The prefix cache plugin exposes the following advanced configuration parameters:
1616

1717
* `blockSize`: The plugin matches prefixes in the unit of blocks. This is the size
18-
of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default
19-
is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
20-
extremely long inputs.
18+
of each block in number of bytes. At runtime, EPP can dynamically fetch this information from the
19+
inference engine metrics, therefore this config is only used when such metric is not available. In
20+
vLLM, the metric name is `vllm:cache_config_info` and the metric label is `block_size`. See the
21+
[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
22+
for more details.
23+
24+
vLLM default block size is 16 tokens. Assume 4 characters per token, the default
25+
is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
26+
extremely long inputs.
2127

2228
* `maxPrefixBlocksToMatch`: The maximum number of blocks to find prefix match. The default is
2329
256 (or 256*64=16384 characters, or roughly 4096 tokens). This is useful to tradeoff prefix match accuracy
2430
for performance.
2531

26-
* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod). Below
27-
shows a detailed analysis on how to estimate this.
32+
* `lruCapacityPerServer`: Maximum capacity the prefix LRU cache in number of block hashes per server (pod).
33+
Similar to `blockSize`, EPP can dynamically fetch this from the inference engine metrics endpoints.
34+
In vLLM, the metric name is `vllm:cache_config_info` and the metric label is `num_gpu_blocks`. See the
35+
[model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol)
36+
for more details.
37+
38+
If such metric is not available, you can follow the guide below on how to estimate this.
2839

29-
The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect
30-
scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
31-
the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
32-
false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
33-
Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
40+
The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect
41+
scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
42+
the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
43+
false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
44+
Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
3445

35-
NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
36-
in tokens, a conversion between character <-> token is needed.
46+
NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
47+
in tokens, a conversion between character <-> token is needed.
3748

38-
Below are the formulas to estimate the EPP prefix indexer size:
49+
Below are the formulas to estimate the EPP prefix indexer size:
3950

40-
```
41-
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
42-
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
43-
```
51+
```
52+
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
53+
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
54+
```
4455

45-
Let's take an example:
56+
Let's take an example:
4657

47-
* Model: llama3 8B
48-
* Accelerator: Nvidia H100 80GB
49-
* Num replicas: 3
50-
* Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
58+
* Model: llama3 8B
59+
* Accelerator: Nvidia H100 80GB
60+
* Num replicas: 3
61+
* Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
5162

52-
```
53-
max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
54-
# assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
55-
# each entry is about 358KB, so the memory footrpint is abut 11 MB per server
56-
lru_indexer_capacity_per_server = 500,000*4/64 = 31250
57-
```
63+
```
64+
max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
65+
# assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
66+
# each entry is about 358KB, so the memory footrpint is abut 11 MB per server
67+
lru_indexer_capacity_per_server = 500,000*4/64 = 31250
68+
```

0 commit comments

Comments
 (0)