|
| 1 | +# Prefix Cache Aware Plugin Configuration |
| 2 | + |
| 3 | +The [prefix cache plugin](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/7617439188b410670ed0f1ff805a3b7f9918a75b/pkg/epp/scheduling/framework/plugins/multi/prefix/plugin.go#L63) |
| 4 | +takes advantage of the prefix caching (e.g., [vllm APC](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)) |
| 5 | +feature of model servers, and optimizes request scheduling by placing requests sharing the longest |
| 6 | +prefixes to the same server as much as possible, while balancing the server load by considering kv-cache |
| 7 | +and queue depth. |
| 8 | + |
| 9 | +## Enable the prefix cache plugin |
| 10 | + |
| 11 | +Currently prefix cache aware plugin is implemented in the V2 scheduler as an experimental feature. |
| 12 | +To enable it, set the following environment variables when starting the EndpointPicker(EPP). |
| 13 | + |
| 14 | +``` |
| 15 | +EXPERIMENTAL_USE_SCHEDULER_V2: true |
| 16 | +ENABLE_PREFIX_CACHE_SCHEDULING: true |
| 17 | +``` |
| 18 | + |
| 19 | +See the [Use Helm section](#helm) to install an inferencepool with the environment variables. |
| 20 | + |
| 21 | + |
| 22 | +## Customize the prefix cache plugin |
| 23 | + |
| 24 | +The prefix cache plugin exposes the following advanced configuration options via environment variables: |
| 25 | + |
| 26 | +* `PREFIX_CACHE_HASH_BLOCK_SIZE`: The plugin matches prefixes in the unit of blocks. This is the size |
| 27 | +of each block in number of bytes. vLLM default block size is 16 tokens. Assume 4 characters per token, the default |
| 28 | +is set to 64 in EPP. The default is recommended unless performance is critical for use cases with |
| 29 | +extremely long inputs. |
| 30 | + |
| 31 | +* `PREFIX_CACHE_MAX_PREFIX_BLOCKS`: The maximum number of blocks to find prefix match. The default is |
| 32 | +128 (or 128*64=8192 characters, or roughly 2048 tokens). This is useful to tradeoff prefix match accuracy |
| 33 | +for performance. |
| 34 | + |
| 35 | +* `PREFIX_CACHE_LRU_CAPACITY`: Maximum capacity the prefix LRU indexer in number of block hashes. Below |
| 36 | +shows a detailed analysis on how to estimate this. |
| 37 | + |
| 38 | + The prefix cache plugin estimates the prefix cache indexes in model server HBMs. In the perfect |
| 39 | + scenario, EPP has the exact same prefix cache entries per model server as their HBM cache entries. If |
| 40 | + the EPP cache is smaller than HBM cache, a positive EPP cache match is more accurate, but there are more |
| 41 | + false cache misses. If the EPP cache is larger than the HBM cache, then there are more false cache hits. |
| 42 | + Therefore **the EPP prefix cache indexer size should be as close as possible to the HBM cache size.** |
| 43 | + |
| 44 | + NOTE: EPP builds prefix cache based on characters, while model server maintains prefix cache entries |
| 45 | + in tokens, a conversion between character <-> token is needed. |
| 46 | + |
| 47 | + Below are the formulas to estimate the EPP prefix indexer size: |
| 48 | + |
| 49 | + ``` |
| 50 | + max_kv_tokens_per_server = (HBM_size - model_size)/ kv_size_per_token |
| 51 | + lru_indexer_capacity_per_server = (max_kv_tokens_per_server * avg_chars_per_token)/prefix_indexer_hash_block_size |
| 52 | + lru_indexer_capacity_total = max_num_servers * lru_indexer_capacity_per_server |
| 53 | + ``` |
| 54 | + |
| 55 | + Let's take an example: |
| 56 | + |
| 57 | + * Model: llama3 8B |
| 58 | + * Accelerator: Nvidia H100 80GB |
| 59 | + * Num replicas: 3 |
| 60 | + * Estimated # characters per token: 4 ([source](https://genai.stackexchange.com/questions/34/how-long-is-a-token)) |
| 61 | + |
| 62 | + ``` |
| 63 | + max_kv_tokens_per_server = (80GB - 16GB) / 128KB = 500,000 |
| 64 | + # assume avg_chars_per_token = 4, prefix_indexer_hash_block_size = 64 (default) |
| 65 | + # each entry is about 358KB, so the memory footrpint is abut 11 MB per server |
| 66 | + lru_indexer_capacity_per_server = 500,000*4/64 = 31250 |
| 67 | + lru_indexer_capacity_total = 3 * 31250 = 93750 |
| 68 | + ``` |
| 69 | + |
| 70 | +See the [Use Helm section](#helm) to install an inferencepool with the environment variables. |
| 71 | + |
| 72 | + |
| 73 | +<a id="helm"></a> |
| 74 | +## Use Helm |
| 75 | + |
| 76 | +Use the following reference command to install an inferencepool with the prefix |
| 77 | +cache plugin environment variable configurations: |
| 78 | + |
| 79 | +```txt |
| 80 | +$ helm install triton-llama3-8b-instruct \ |
| 81 | + --set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \ |
| 82 | + --set inferencePool.modelServerType=triton-tensorrt-llm \ |
| 83 | + --set provider.name=[none|gke] \ |
| 84 | + --set inferenceExtension.env.EXPERIMENTAL_USE_SCHEDULER_V2=true \ |
| 85 | + --set inferenceExtension.env.ENABLE_PREFIX_CACHE_SCHEDULING=true \ |
| 86 | + --set inferenceExtension.env.PREFIX_CACHE_LRU_CAPACITY=93750 \ |
| 87 | + --set inferenceExtension.env.PREFIX_CACHE_MAX_PREFIX_BLOCKS=1024 \ |
| 88 | + oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0 |
| 89 | +``` |
0 commit comments