Skip to content

Commit 070cbfb

Browse files
authored
Update model server protocol with prefix cache reuse (#1077)
1 parent 33eb946 commit 070cbfb

File tree

1 file changed

+11
-4
lines changed
  • docs/proposals/003-model-server-protocol

1 file changed

+11
-4
lines changed

docs/proposals/003-model-server-protocol/README.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -21,10 +21,10 @@ effort.
2121
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
2222
into the reference endpoint picker implementation.
2323

24-
| Metric | Type | Description | vLLM metric |
25-
| ----- | ---- | ---- | ---- |
26-
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
27-
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
24+
| Metric | Type | Description | vLLM metric | Triton TensorRT-LLM|
25+
| ----- | ---- | ---- | ---- | ---- |
26+
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| `nv_trt_llm_request_metrics{request_type=waiting}`|
27+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| `nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}`|
2828

2929

3030
### LoRA Adapter Serving
@@ -48,3 +48,10 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
4848
* `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
4949
memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`
5050
* `waiting_lora_adapters`: A comma separated list of adapters that are waiting to be served. Example: `"waiting_lora_adapters": "adapter1, adapter2"`
51+
52+
### Prefix Cache Reuse
53+
54+
Starting from [v0.4.0](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.4.0),
55+
the EPP supports [prefix cache optimized request scheduling](https://gateway-api-inference-extension.sigs.k8s.io/guides/epp-configuration/prefix-aware/).
56+
To benefit from the optimal prefix aware request scheduling, model servers SHOULD support prefix
57+
cache reuse, such as the [vllm automatic prefix caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html) feature.

0 commit comments

Comments
 (0)