Skip to content

vllm_router : add debug headers in the response #683

@ghpu

Description

@ghpu

Describe the feature

For testing and benchmark purposes, I would like vllm_router to send back in the response some headers (or json fields) mentioning backend information.

Here is an example of how we coudl do it (courtesy of Claude Code) :

Based on my analysis of the vLLM router code, I can see how to make the router return information about which backend processed the request. The
router already passes the backend URL to process_request, and we can include this information in the response headers.

Let me update the code to add backend information to the response headers:

● Update(src/vllm_router/services/request_service/request.py)
⎿  Updated src/vllm_router/services/request_service/request.py with 8 additions
294 headers, status = await anext(stream_generator)
295 headers_dict = {key: value for key, value in headers.items()}
296 headers_dict["X-Request-Id"] = request_id
297 + # Add backend information to response headers
298 + headers_dict["X-Backend-Server"] = server_url
299 + # Find the endpoint that was used
300 + for ep in endpoints:
301 + if ep.url == server_url:
302 + headers_dict["X-Backend-Id"] = ep.Id
303 + headers_dict["X-Backend-Pod"] = ep.pod_name if hasattr(ep, 'pod_name') else "unknown"
304 + break
305 return StreamingResponse(
306 stream_generator,
307 status_code=status,

● Update(src/vllm_router/services/request_service/request.py)
⎿  Updated src/vllm_router/services/request_service/request.py with 6 additions and 1 removal
435 return StreamingResponse(
436 generate_stream(),
437 media_type="application/json",
438 - headers={"X-Request-Id": request_id},
438 + headers={
439 + "X-Request-Id": request_id,
440 + "X-Backend-Server-Prefill": str(request.app.state.prefill_client._base_url),
441 + "X-Backend-Server-Decode": str(request.app.state.decode_client._base_url),
442 + "X-Backend-Type": "disaggregated"
443 + },
444 )
445
446

● Update(src/vllm_router/services/request_service/request.py)
⎿  Updated src/vllm_router/services/request_service/request.py with 8 additions
667 }
668
669 headers["X-Request-Id"] = request_id
670 + # Add backend information to response headers
671 + headers["X-Backend-Server"] = chosen_url
672 + # Find the endpoint that was used
673 + for ep in transcription_endpoints:
674 + if ep.url == chosen_url:
675 + headers["X-Backend-Id"] = ep.Id
676 + headers["X-Backend-Pod"] = ep.pod_name if hasattr(ep, 'pod_name') else "unknown"
677 + break
678
679 return JSONResponse(
680 content=response_content,

● Let me also check if there are any other routes that need updating:

● Search(pattern: "route_sleep_wakeup_request", path: "src/vllm_router/services/request_service/request.py", output_mode: "content")
⎿  Found 23 lines (ctrl+r to expand)

● Let me also update the sleep/wakeup route to include backend information:

● Update(src/vllm_router/services/request_service/request.py)
⎿  Updated src/vllm_router/services/request_service/request.py with 6 additions and 1 removal
519 return JSONResponse(
520 status_code=response_status,
521 content={"status": "success"},
522 - headers={"X-Request-Id": request_id},
522 + headers={
523 + "X-Request-Id": request_id,
524 + "X-Backend-Server": server_url,
525 + "X-Backend-Id": endpoints[0].Id,
526 + "X-Backend-Pod": endpoints[0].pod_name if hasattr(endpoints[0], 'pod_name') else "unknown"
527 + },
528 )
529
530

Why do you need this feature?

For performance benchmark of disaggregated and heterogenous clusters

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions