Skip to content

Auto-configure prefix-cache-scorer parameters from engine metrics #1512

@qiumuyang

Description

@qiumuyang

What would you like to be added:

Currently, the prefix-cache-scorer plugin relies on manually set parameters: hashBlockSize and lruCapacityPerServer. These values are used to estimate the KV cache behavior of the downstream LLM serving engine.

vLLM provides its cache config info in prometheus metrics.

# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="None",gpu_memory_utilization="0.9",is_attention_free="False",num_cpu_blocks="9362",num_gpu_blocks="55477",num_gpu_blocks_override="None",prefix_caching_hash_algo="builtin",sliding_window="None",swap_space="4",swap_space_bytes="4294967296"} 1.0

So I wonder if it is possible to scrape the parameters from engine metrics instead.

Why is this needed:

It may be difficult to manually calculate lruCapacityPerServer for different models and GPUs.

This can improve the accuracy and reduce the maintenance effort.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions