Skip to content

Commit 4aa1a8b

Browse files
authored
Merge branch 'main' into pr/transcription-whisper
2 parents 10f1caa + f5205ce commit 4aa1a8b

File tree

2 files changed

+77
-3
lines changed

2 files changed

+77
-3
lines changed

docs/source/use_cases/kv-cache-aware-routing.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
KV Cache Aware Routing
22
======================
33

4-
This tutorial demonstrates how to use KV cache aware routing in the vLLM Production Stack. KV cache aware routing ensures that subsequent requests with the same prompt prefix are routed to the same instance, maximizing KV cache utilization and improving performance.
4+
In this tutorial, you'll learn how to enable and use KV cache aware routing in the vLLM Production Stack. With KV cache aware routing, incoming requests are routed to the instance with the highest KV cache hit rate, which helps maximize cache efficiency and boost overall performance. Unlike prefix aware routing—which always sends requests with the same prefix to the same instance, even if the cache has been evicted—KV cache aware routing prioritizes cache hits to optimize resource usage.
55

66
Table of Contents
77
-----------------
@@ -78,7 +78,7 @@ Then, send another request with the same prompt prefix:
7878
"max_tokens": 100
7979
}'
8080
81-
You should observe that the second request is routed to the same instance as the first request. This is because the KV cache aware router detects that the second request shares a prefix with the first request and routes it to the same instance to maximize KV cache utilization.
81+
You should observe that the second request is routed to the same instance as the first request. This is because the KV cache aware router detects that the second request has a higher KV cache hit rate in the instance of the first request and routes it to the same instance to maximize KV cache utilization.
8282

8383
Step 4: Clean Up
8484
-----------------
@@ -98,4 +98,4 @@ In this tutorial, we've demonstrated how to:
9898
2. Set up port forwarding to access the router
9999
3. Test the KV cache aware routing functionality
100100

101-
The KV cache aware routing feature helps improve performance by ensuring that requests with shared prefixes are routed to the same instance, maximizing KV cache utilization.
101+
The KV cache aware routing feature helps improve performance by ensuring that requests will be routed to the instance with the highest KV cache hit rate, maximizing KV cache utilization.

src/tests/test_roundrobin_router.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
import random
2+
from typing import Dict, List, Tuple
3+
4+
from vllm_router.routers.routing_logic import RoundRobinRouter
5+
6+
7+
class EndpointInfo:
8+
def __init__(self, url: str):
9+
self.url = url
10+
11+
12+
class RequestStats:
13+
def __init__(self, qps: float):
14+
self.qps = qps
15+
16+
17+
class Request:
18+
def __init__(self, headers: Dict[str, str]):
19+
self.headers = headers
20+
21+
22+
class EngineStats:
23+
def __init__(self):
24+
return
25+
26+
27+
def generate_request_args(
28+
num_endpoints: int, qps_range: int = 0
29+
) -> Tuple[List[EndpointInfo], Dict[str, EngineStats], Dict[str, RequestStats]]:
30+
endpoints = [
31+
EndpointInfo(
32+
url=f"{endpoint_index}",
33+
)
34+
for endpoint_index in range(num_endpoints)
35+
]
36+
engine_stats = {
37+
f"{endpoint_index}": EngineStats() for endpoint_index in range(num_endpoints)
38+
}
39+
request_stats = {
40+
f"{endpoint_index}": RequestStats(qps=random.uniform(0, qps_range))
41+
for endpoint_index in range(num_endpoints)
42+
}
43+
return endpoints, engine_stats, request_stats
44+
45+
46+
def generate_request(request_type="http") -> Request:
47+
return Request({"type": request_type})
48+
49+
50+
def test_roundrobin_logic(
51+
dynamic_discoveries: int = 10, max_endpoints: int = 1000, max_requests: int = 10000
52+
):
53+
"""
54+
Ensure that all active urls have roughly same number of requests (difference at most 1)
55+
"""
56+
router = RoundRobinRouter()
57+
58+
def _fixed_router_check(num_endpoints: int, num_requests: int) -> bool:
59+
# Make num_requests requests to the router and check even output distribution
60+
endpoints, engine_stats, request_stats = generate_request_args(num_endpoints)
61+
output_distribution = {}
62+
for request_idx in range(num_requests):
63+
request = generate_request()
64+
url = router.route_request(endpoints, engine_stats, request_stats, request)
65+
output_distribution[url] = output_distribution.get(url, 0) + 1
66+
request_counts = output_distribution.values()
67+
return max(request_counts) - min(request_counts) <= 1
68+
69+
for _ in range(dynamic_discoveries):
70+
num_endpoints = random.randint(1, max_endpoints)
71+
num_requests = random.randint(1, max_requests)
72+
# Perform router check
73+
res = _fixed_router_check(num_endpoints, num_requests)
74+
assert res

0 commit comments

Comments
 (0)