You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Pluggable **filters**, **scorers**, and **scrapers** for extensible routing
@@ -245,29 +245,14 @@ Filters out pods that are not marked as prefill. The filter looks for the label
245
245
246
246
---
247
247
248
-
#### PrefixCacheScorer
248
+
#### PrecisePrefixCacheScorer
249
249
250
-
The `prefix-cache-scorer` scores a request based on KV-cache localities.
251
-
It supports two modes: `estimate`and `cache_tracking`.
252
-
253
-
##### `estimate` mode (default):
254
-
255
-
This mode uses the default GIE prefix scorer and scores pods based on the estimated cache locality of the prompt.
256
-
The estimation is based on scheduling history.
257
-
258
-
- **Type**: `prefix-cache-scorer`
259
-
- **Parameters**:
260
-
- `hashBlockSize`: Specifies the size of the blocks used to split the input **prompt** when calculating block hashes. Defaults to `64` if not specified.
261
-
- `maxPrefixBlocksToMatch`: Specifies the maximum number of prefix blocks to match. Defaults to `256` if not specified.
262
-
- `lruCapacityPerServer`: Specifies the capacity of the LRU indexer, in number of entries per server (pod). Defaults to `31,250` if not specified.
263
-
264
-
**Note:** `mode: estimate` is not required, as it is the default.
265
-
266
-
##### `cache_tracking` mode:
267
-
268
-
This mode scores requests based on the actual KV-cache states across the vLLM instances.
269
-
It is more accurate than both `SessionAffinity` and `PrefixCachePlugin` in `estimate` mode,
270
-
but incurs additional computation overhead and KV-Events streaming to track the KV-cache states.
250
+
The `precise-prefix-cache-scorer` scores a request based on KV-cache localities.
251
+
Similarly to the IGW `prefix-cache-scorer`, it provides a score based on the number of
252
+
matching KV-cache blocks between the request's prompt and the KV-cache contents of each pod.
253
+
However, unlike the IGW `prefix-cache-scorer`, which relies on estimations based on scheduling history,
254
+
the `precise-prefix-cache-scorer` tracks the real-time KV-cache states across the vLLM instances to
255
+
provide more accurate scoring.
271
256
272
257
When enabled, the scorer will use the `llm-d-kv-cache-manager` to track the KV-cache states
273
258
across the vLLM instances. It will use the `kvcache.Indexer` to score the pods based on the
@@ -276,9 +261,8 @@ When enabled, the scorer will use the `llm-d-kv-cache-manager` to track the KV-c
276
261
277
262
Configuration:
278
263
279
-
- **Type**: `prefix-cache-scorer`
264
+
- **Type**: `precise-prefix-cache-scorer`
280
265
- **Parameters**:
281
-
- `mode: cache_tracking`
282
266
- `indexerConfig`: Configuration for the `kvcache.Indexer`.
283
267
- `kvEventsConfig`: Configuration for the `kvevents.Pool`.
284
268
@@ -294,7 +278,7 @@ Example configuration with the above parameters set:
294
278
295
279
```yaml
296
280
plugins:
297
-
- type: prefix-cache-scorer
281
+
- type: precise-prefix-cache-scorer
298
282
parameters:
299
283
indexerConfig:
300
284
tokenProcessorConfig:
@@ -310,7 +294,7 @@ Example configuration with all parameters set:
0 commit comments