Skip to content

Conversation

kobe0938
Copy link
Contributor

@kobe0938 kobe0938 commented Jul 2, 2025

This PR

  1. verifies that the router correctly distributes requests using round-robin logic and while handling high concurrency without routing logic failures
  2. adds a github action to stress test round-robin router with 20k concurrent requests
  3. records the log and performance metrics of round-robin router for future comparison with other routing logics

TODO: will add stress tests for prefix-aware and session-based router later
TODO: will add performance comparison tests & threshold later

This test uses MOCK backends and MOCK responses (no real vLLM servers)
Backend ports are dummy placeholders - no actual services run on them
Model names are dummy placeholders - no real models are loaded
When VLLM_ROUTER_STRESS_TEST_MODE=true, the router returns mock responses
instead of forwarding requests to backends (see src/vllm_router/services/request_service/request.py)
This test validates ONLY the router's routing logic, load balancing, and performance
under high concurrent loads, not actual inference capabilities

CleanShot 2025-07-02 at 00 30 37@2x

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @kobe0938, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dedicated and robust end-to-end stress testing framework for the vLLM router. It enables comprehensive validation of the router's round-robin load balancing capabilities and performance under high concurrent loads by simulating requests and mock responses, thereby isolating the router's behavior from actual backend inference processes.

Highlights

  • Mock Response Mechanism for Stress Testing: Introduced a new environment variable, VLLM_ROUTER_STRESS_TEST_MODE. When this variable is set to true, the router's request processing (src/vllm_router/services/request_service/request.py lines 99-110) is short-circuited to return mock responses directly. This allows for isolated stress testing of the router's internal logic and performance without requiring actual vLLM backend servers, while still ensuring request_stats_monitor calls are made for metric collection.
  • Comprehensive Round-Robin Stress Test Script: Added a new end-to-end bash script (tests/e2e/router-stress-test.sh) designed to stress test the router's round-robin load balancing. This script automates the process of: starting the router in stress test mode, configuring it with static mock backends and round-robin logic, using Apache Bench (ab) to generate a high volume of concurrent requests (defaulting to 20,000 concurrent and 100,000 total requests), and crucially, analyzing router logs to verify the correctness of the round-robin distribution (ensuring requests are evenly distributed within a 20% tolerance).
  • Router Performance and Correctness Validation: The primary objective of these changes is to verify that the router correctly distributes requests using round-robin logic and can handle high concurrency without routing failures. It also aims to record performance metrics under these conditions for future comparisons with other routing algorithms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a stress test for the router's round-robin logic, adding a mock response mode to the request service and a new shell script for execution. The changes are generally well-implemented, but there are important considerations regarding production safety, code maintainability, and script correctness that are detailed in the review comments.

Comment on lines 100 to 110
if os.getenv("VLLM_ROUTER_STRESS_TEST_MODE", "false").lower() == "true":
# Mock response for stress testing - skip backend calls
mock_headers = {"content-type": "application/json", "x-request-id": request_id}
mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'

# Yield headers and mock response
yield mock_headers, 200
request.app.state.request_stats_monitor.on_request_response(backend_url, request_id, time.time())
yield mock_response
request.app.state.request_stats_monitor.on_request_complete(backend_url, request_id, time.time())
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Adding a test-only mode controlled by an environment variable directly in the request path introduces a potential risk. If VLLM_ROUTER_STRESS_TEST_MODE is accidentally set in a production environment, the router will serve only mock responses, causing a service outage. A safer approach is to check this environment variable only at application startup and log a prominent warning if it's enabled. This makes the application's mode of operation clear and avoids the per-request overhead of os.getenv. While changing app.py is outside this PR's scope, it's a critical consideration for production safety.

if os.getenv("VLLM_ROUTER_STRESS_TEST_MODE", "false").lower() == "true":
# Mock response for stress testing - skip backend calls
mock_headers = {"content-type": "application/json", "x-request-id": request_id}
mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding a JSON string makes it less readable and harder to maintain. It's better to define the response as a Python dictionary and serialize it to JSON. This improves clarity and simplifies future modifications.

Suggested change
mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'
mock_response = json.dumps({
"id": "test",
"object": "chat.completion",
"choices": [{
"message": {"role": "assistant", "content": "Test"},
"index": 0,
"finish_reason": "stop"
}]
}).encode("utf-8")

print_status "Concurrent: $CONCURRENT, Total: $REQUESTS"

# Create payload file
local payload_file="/tmp/stress_payload.json"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a fixed filename in /tmp for the payload file can cause conflicts if multiple instances of this script run concurrently. This can lead to race conditions and unpredictable behavior. It's safer to use mktemp to generate a unique temporary filename.

Suggested change
local payload_file="/tmp/stress_payload.json"
local payload_file=$(mktemp /tmp/stress_payload.XXXXXX.json)

kobe0938 and others added 2 commits July 2, 2025 00:54
Signed-off-by: Kobe Chen <[email protected]>
per gemini's suggestion

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Kobe Chen <[email protected]>
@kobe0938 kobe0938 force-pushed the stress-test-router branch from 37fda5d to 0e79e3d Compare July 2, 2025 08:09
Signed-off-by: Kobe Chen <[email protected]>

router-stress-test:
runs-on: self-hosted
needs: e2e-test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stress test should mainly focus on performance rather than routing correctness since the latter is covered in discovery test. Therefore this action can be a separate workflow and can be further pipelined with other workflows.

with:
name: router-stress-test-results-pr-${{ github.event.pull_request.number || 'main' }}
path: |
${{ env.LOG_DIR }}/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more concise result that better showcases the performance can be uploaded, as the current log file is too large.

@@ -231,3 +231,62 @@ jobs:
pkill -f "python3 -m src.vllm_router.app" || true

- run: echo "🍏 Static discovery e2e test job status is ${{ job.status }}."

router-stress-test:
runs-on: self-hosted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can GitHub-hosted Runner handle this task since no real model is needed?

--requests 100000 \
--port 30080 \
--log-dir "$LOG_DIR" \
--model "facebook/opt-125m" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to deprecate these fields to improve clarity.

request.app.state.request_stats_monitor.on_request_response(
backend_url, request_id, time.time()
)
yield mock_response
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a stub HTTP server or dummy backend server in this way we do no need to modify the source code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants