You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: Integrate prefix-cache configuration into a single knob (#237)
Simplifies the configuration structure of the prefix-cache-scorer plugin by unifying all mode-specific parameters into a single configuration type. The prefix-cache-scorer now supports a mode option:
- When set to estimate (the default), it uses the GIE prefix cache scorer based on estimation from previous requests.
- When set to cache_tracking, it creates a prefix cache scorer based on KV-events from vLLM.
Signed-off-by: Kfir Toledo <[email protected]>
Copy file name to clipboardExpand all lines: docs/architecture.md
+24-8Lines changed: 24 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -211,14 +211,30 @@ with a value of `prefill`.<br>
211
211
*Type:* prefill-filter<br>
212
212
*Parameters:* None<br>
213
213
214
-
**KvCacheAwareScorer**<br>
215
-
Scores based on real KV-cache state on vLLM. It is more accurate than either the SessionAffinity
216
-
or PrefixCachePlugin, but requires extra computation and cycles to track the current cache state<br>
217
-
*Type:* kvcache-aware-scorer<br>
218
-
*Parameters:* Due to the sensitivity of the parameters of this plugin, the following
219
-
environment variables are used to configure the scorer:<br>
220
-
`KVCACHE_INDEXER_REDIS_ADDR`- the address of the Redis server used<br>
221
-
`HF_TOKEN`- the Hugginface token to be used.<br>
214
+
**PrefixCacheScorer**<br>
215
+
The `prefix-cache-scorer` scores a request based on the KV cache localities.
216
+
It supports two modes: `estimate`and `cache_tracking`.<br>
217
+
218
+
**`estimate` mode** (default):<br>
219
+
This mode uses the default GIE prefix scorer and scores pods based on how much of the prompt is estimated to be present in the pod’s KV cache.<br>
220
+
*Type*: `prefix-cache-scorer`<br>
221
+
*Parameters:*<br>
222
+
223
+
\- `hashBlockSize`: Specifies the size of the blocks used to split the input **prompt** when calculating block hashes. Defaults to `64` if not specified.<br>
224
+
\- `maxPrefixBlocksToMatch`: Specifies the maximum number of prefix blocks to match. Defaults to `256` if not specified.<br>
225
+
\- `lruCapacityPerServer`: Specifies the capacity of the LRU indexer, in number of entries per server (pod). Defaults to `31,250` if not specified.<br>
226
+
227
+
**Note:** \- `mode: estimate` is not required, as it is the default.
228
+
229
+
**`cache_tracking` mode**:<br>
230
+
This mode scores requests based on the actual KV cache state in vLLM. It is more accurate than both `SessionAffinity` and `PrefixCachePlugin` in `estimate` mode,
231
+
but incurs additional computation overhead to track the current cache state.<br>
232
+
*Type*: `prefix-cache-scorer`<br>
233
+
*Parameters:*<br>
234
+
\- `mode: cache_tracking`<br>
235
+
\- `kvCacheRedisAddr`: The address of the Redis instance used for cache tracking.
236
+
Due to the sensitivity of this plugin’s parameters, the following environment variable is required when using `cache_tracking` mode:
237
+
`HF_TOKEN`: The Hugging Face token to be used.
222
238
223
239
**LoadAwareScorer**<br>
224
240
Scores pods based on their load, based on the number of requests concurrently being processed.
0 commit comments