generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 176
Open
Labels
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
What would you like to be added:
Currently, the prefix-cache-scorer
plugin relies on manually set parameters: hashBlockSize
and lruCapacityPerServer
. These values are used to estimate the KV cache behavior of the downstream LLM serving engine.
vLLM provides its cache config info in prometheus metrics.
# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="None",gpu_memory_utilization="0.9",is_attention_free="False",num_cpu_blocks="9362",num_gpu_blocks="55477",num_gpu_blocks_override="None",prefix_caching_hash_algo="builtin",sliding_window="None",swap_space="4",swap_space_bytes="4294967296"} 1.0
So I wonder if it is possible to scrape the parameters from engine metrics instead.
Why is this needed:
It may be difficult to manually calculate lruCapacityPerServer
for different models and GPUs.
This can improve the accuracy and reduce the maintenance effort.
Metadata
Metadata
Assignees
Labels
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.