Skip to content

Commit b62931e

Browse files
authored
Add prefix cache plugin configuration guide (#923)
1 parent c7fa41f commit b62931e

File tree

2 files changed

+91
-0
lines changed

2 files changed

+91
-0
lines changed

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,8 @@ nav:
6868
- Adapter Rollout: guides/adapter-rollout.md
6969
- InferencePool Rollout: guides/inferencepool-rollout.md
7070
- Metrics: guides/metrics.md
71+
- Configuration Guide:
72+
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7173
- Implementer's Guide: guides/implementers.md
7274
- Performance:
7375
- Benchmark: performance/benchmark/index.md
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Prefix Cache Aware Plugin Configuration
2+
3+
The [prefix cache plugin](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/7617439188b410670ed0f1ff805a3b7f9918a75b/pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go#L63)
4+
takes advantage of the prefix caching (e.g., [vllm APC](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html))
5+
feature of model servers, and optimizes request scheduling by placing requests sharing the longest
6+
prefixes to the same server as much as possible, while balancing the server load by considering kv-cache
7+
and queue depth.
8+
9+
## Enable the prefix cache plugin
10+
11+
Currently prefix cache aware plugin is implemented in the V2 scheduler as an experimental feature.
12+
To enable it, set the following environment variables when starting the EndpointPicker(EPP).
13+
14+
```
15+
EXPERIMENTAL_USE_SCHEDULER_V2: true
16+
ENABLE_PREFIX_CACHE_SCHEDULING: true
17+
```
18+
19+
See the [Use Helm section](#helm) to install an inferencepool with the environment variables.
20+
21+
22+
## Customize the prefix cache plugin
23+
24+
The prefix cache plugin exposes the following advanced configuration options via environment variables:
25+
26+
* `PREFIX_CACHE_HASH_BLOCK_SIZE`: The plugin matches prefixes in the unit of blocks. This is the size
27+
of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default
28+
is set to 64 in EPP. The default is recommended unless performance is critical for use cases with
29+
extremely long inputs.
30+
31+
* `PREFIX_CACHE_MAX_PREFIX_BLOCKS`: The maximum number of blocks to find prefix match. The default is
32+
128 (or 128*64=8192 characters, or roughly 2048 tokens). This is useful to tradeoff prefix match accuracy
33+
for performance.
34+
35+
* `PREFIX_CACHE_LRU_CAPACITY`: Maximum capacity the prefix LRU indexer in number of block hashes. Below
36+
shows a detailed analysis on how to estimate this.
37+
38+
The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect
39+
scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If
40+
the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more
41+
false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits.
42+
Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.**
43+
44+
NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries
45+
in tokens, a conversion between character <-> token is needed.
46+
47+
Below are the formulas to estimate the EPP prefix indexer size:
48+
49+
```
50+
max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token
51+
lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size
52+
lru_indexer_capacity_total = max_num_servers * lru_indexer_capacity_per_server
53+
```
54+
55+
Let's take an example:
56+
57+
* Model: llama3 8B
58+
* Accelerator: Nvidia H100 80GB
59+
* Num replicas: 3
60+
* Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token))
61+
62+
```
63+
max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000
64+
# assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default)
65+
# each entry is about 358KB, so the memory footrpint is abut 11 MB per server
66+
lru_indexer_capacity_per_server = 500,000*4/64 = 31250
67+
lru_indexer_capacity_total = 3 * 31250 = 93750
68+
```
69+
70+
See the [Use Helm section](#helm) to install an inferencepool with the environment variables.
71+
72+
73+
<a id="helm"></a>
74+
## Use Helm
75+
76+
Use the following reference command to install an inferencepool with the prefix
77+
cache plugin environment variable configurations:
78+
79+
```txt
80+
$ helm install triton-llama3-8b-instruct \
81+
--set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \
82+
--set inferencePool.modelServerType=triton-tensorrt-llm \
83+
--set provider.name=[none|gke] \
84+
--set inferenceExtension.env.EXPERIMENTAL_USE_SCHEDULER_V2=true \
85+
--set inferenceExtension.env.ENABLE_PREFIX_CACHE_SCHEDULING=true \
86+
--set inferenceExtension.env.PREFIX_CACHE_LRU_CAPACITY=93750 \
87+
--set inferenceExtension.env.PREFIX_CACHE_MAX_PREFIX_BLOCKS=1024 \
88+
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
89+
```

0 commit comments

Comments
 (0)