-
Notifications
You must be signed in to change notification settings - Fork 372
Description
Describe the feature
For testing and benchmark purposes, I would like vllm_router to send back in the response some headers (or json fields) mentioning backend information.
Here is an example of how we coudl do it (courtesy of Claude Code) :
Based on my analysis of the vLLM router code, I can see how to make the router return information about which backend processed the request. The
router already passes the backend URL to process_request, and we can include this information in the response headers.
Let me update the code to add backend information to the response headers:
● Update(src/vllm_router/services/request_service/request.py)
⎿ Updated src/vllm_router/services/request_service/request.py with 8 additions
294 headers, status = await anext(stream_generator)
295 headers_dict = {key: value for key, value in headers.items()}
296 headers_dict["X-Request-Id"] = request_id
297 + # Add backend information to response headers
298 + headers_dict["X-Backend-Server"] = server_url
299 + # Find the endpoint that was used
300 + for ep in endpoints:
301 + if ep.url == server_url:
302 + headers_dict["X-Backend-Id"] = ep.Id
303 + headers_dict["X-Backend-Pod"] = ep.pod_name if hasattr(ep, 'pod_name') else "unknown"
304 + break
305 return StreamingResponse(
306 stream_generator,
307 status_code=status,
● Update(src/vllm_router/services/request_service/request.py)
⎿ Updated src/vllm_router/services/request_service/request.py with 6 additions and 1 removal
435 return StreamingResponse(
436 generate_stream(),
437 media_type="application/json",
438 - headers={"X-Request-Id": request_id},
438 + headers={
439 + "X-Request-Id": request_id,
440 + "X-Backend-Server-Prefill": str(request.app.state.prefill_client._base_url),
441 + "X-Backend-Server-Decode": str(request.app.state.decode_client._base_url),
442 + "X-Backend-Type": "disaggregated"
443 + },
444 )
445
446
● Update(src/vllm_router/services/request_service/request.py)
⎿ Updated src/vllm_router/services/request_service/request.py with 8 additions
667 }
668
669 headers["X-Request-Id"] = request_id
670 + # Add backend information to response headers
671 + headers["X-Backend-Server"] = chosen_url
672 + # Find the endpoint that was used
673 + for ep in transcription_endpoints:
674 + if ep.url == chosen_url:
675 + headers["X-Backend-Id"] = ep.Id
676 + headers["X-Backend-Pod"] = ep.pod_name if hasattr(ep, 'pod_name') else "unknown"
677 + break
678
679 return JSONResponse(
680 content=response_content,
● Let me also check if there are any other routes that need updating:
● Search(pattern: "route_sleep_wakeup_request", path: "src/vllm_router/services/request_service/request.py", output_mode: "content")
⎿ Found 23 lines (ctrl+r to expand)
● Let me also update the sleep/wakeup route to include backend information:
● Update(src/vllm_router/services/request_service/request.py)
⎿ Updated src/vllm_router/services/request_service/request.py with 6 additions and 1 removal
519 return JSONResponse(
520 status_code=response_status,
521 content={"status": "success"},
522 - headers={"X-Request-Id": request_id},
522 + headers={
523 + "X-Request-Id": request_id,
524 + "X-Backend-Server": server_url,
525 + "X-Backend-Id": endpoints[0].Id,
526 + "X-Backend-Pod": endpoints[0].pod_name if hasattr(endpoints[0], 'pod_name') else "unknown"
527 + },
528 )
529
530
Why do you need this feature?
For performance benchmark of disaggregated and heterogenous clusters
Additional context
No response