You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.|`vllm:num_requests_waiting`|`nv_trt_llm_request_metrics{request_type=waiting}`|
27
+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.|`vllm:gpu_cache_usage_perc`|`nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`|
28
28
29
29
30
30
### LoRA Adapter Serving
@@ -48,3 +48,10 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
48
48
*`running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
49
49
memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`
50
50
*`waiting_lora_adapters`: A comma separated list of adapters that are waiting to be served. Example: `"waiting_lora_adapters": "adapter1, adapter2"`
51
+
52
+
### Prefix Cache Reuse
53
+
54
+
Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0),
55
+
the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
56
+
To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix
57
+
cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.
0 commit comments