Auto-configure prefix-cache-scorer parameters from engine metrics

**What would you like to be added**:

Currently, the `prefix-cache-scorer` plugin relies on manually set parameters: `hashBlockSize` and `lruCapacityPerServer`. These values are used to estimate the KV cache behavior of the downstream LLM serving engine.

vLLM provides its [cache config info](https://github.com/vllm-project/vllm/blob/main/docs/design/metrics.md#cache-config-info) in prometheus metrics.
```
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="None",gpu_memory_utilization="0.9",is_attention_free="False",num_cpu_blocks="9362",num_gpu_blocks="55477",num_gpu_blocks_override="None",prefix_caching_hash_algo="builtin",sliding_window="None",swap_space="4",swap_space_bytes="4294967296"} 1.0
```

So I wonder if it is possible to scrape the parameters from engine metrics instead.

**Why is this needed**:

It may be difficult to manually calculate `lruCapacityPerServer` for different models and GPUs. 

This can improve the accuracy and reduce the maintenance effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-configure prefix-cache-scorer parameters from engine metrics #1512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Auto-configure prefix-cache-scorer parameters from engine metrics #1512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions