-
Notifications
You must be signed in to change notification settings - Fork 277
[CI/Build] Stress tests on router with round-robin #562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
… and logging paths accordingly. Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
Signed-off-by: Kobe Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kobe0938, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a dedicated and robust end-to-end stress testing framework for the vLLM router. It enables comprehensive validation of the router's round-robin load balancing capabilities and performance under high concurrent loads by simulating requests and mock responses, thereby isolating the router's behavior from actual backend inference processes.
Highlights
- Mock Response Mechanism for Stress Testing: Introduced a new environment variable,
VLLM_ROUTER_STRESS_TEST_MODE
. When this variable is set totrue
, the router's request processing (src/vllm_router/services/request_service/request.py
lines 99-110) is short-circuited to return mock responses directly. This allows for isolated stress testing of the router's internal logic and performance without requiring actual vLLM backend servers, while still ensuringrequest_stats_monitor
calls are made for metric collection. - Comprehensive Round-Robin Stress Test Script: Added a new end-to-end bash script (
tests/e2e/router-stress-test.sh
) designed to stress test the router's round-robin load balancing. This script automates the process of: starting the router in stress test mode, configuring it with static mock backends and round-robin logic, using Apache Bench (ab
) to generate a high volume of concurrent requests (defaulting to 20,000 concurrent and 100,000 total requests), and crucially, analyzing router logs to verify the correctness of the round-robin distribution (ensuring requests are evenly distributed within a 20% tolerance). - Router Performance and Correctness Validation: The primary objective of these changes is to verify that the router correctly distributes requests using round-robin logic and can handle high concurrency without routing failures. It also aims to record performance metrics under these conditions for future comparisons with other routing algorithms.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a stress test for the router's round-robin logic, adding a mock response mode to the request service and a new shell script for execution. The changes are generally well-implemented, but there are important considerations regarding production safety, code maintainability, and script correctness that are detailed in the review comments.
if os.getenv("VLLM_ROUTER_STRESS_TEST_MODE", "false").lower() == "true": | ||
# Mock response for stress testing - skip backend calls | ||
mock_headers = {"content-type": "application/json", "x-request-id": request_id} | ||
mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}' | ||
|
||
# Yield headers and mock response | ||
yield mock_headers, 200 | ||
request.app.state.request_stats_monitor.on_request_response(backend_url, request_id, time.time()) | ||
yield mock_response | ||
request.app.state.request_stats_monitor.on_request_complete(backend_url, request_id, time.time()) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a test-only mode controlled by an environment variable directly in the request path introduces a potential risk. If VLLM_ROUTER_STRESS_TEST_MODE
is accidentally set in a production environment, the router will serve only mock responses, causing a service outage. A safer approach is to check this environment variable only at application startup and log a prominent warning if it's enabled. This makes the application's mode of operation clear and avoids the per-request overhead of os.getenv
. While changing app.py
is outside this PR's scope, it's a critical consideration for production safety.
if os.getenv("VLLM_ROUTER_STRESS_TEST_MODE", "false").lower() == "true": | ||
# Mock response for stress testing - skip backend calls | ||
mock_headers = {"content-type": "application/json", "x-request-id": request_id} | ||
mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding a JSON string makes it less readable and harder to maintain. It's better to define the response as a Python dictionary and serialize it to JSON. This improves clarity and simplifies future modifications.
mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}' | |
mock_response = json.dumps({ | |
"id": "test", | |
"object": "chat.completion", | |
"choices": [{ | |
"message": {"role": "assistant", "content": "Test"}, | |
"index": 0, | |
"finish_reason": "stop" | |
}] | |
}).encode("utf-8") |
print_status "Concurrent: $CONCURRENT, Total: $REQUESTS" | ||
|
||
# Create payload file | ||
local payload_file="/tmp/stress_payload.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a fixed filename in /tmp
for the payload file can cause conflicts if multiple instances of this script run concurrently. This can lead to race conditions and unpredictable behavior. It's safer to use mktemp
to generate a unique temporary filename.
local payload_file="/tmp/stress_payload.json" | |
local payload_file=$(mktemp /tmp/stress_payload.XXXXXX.json) |
Signed-off-by: Kobe Chen <[email protected]>
per gemini's suggestion Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kobe Chen <[email protected]>
37fda5d
to
0e79e3d
Compare
Signed-off-by: Kobe Chen <[email protected]>
|
||
router-stress-test: | ||
runs-on: self-hosted | ||
needs: e2e-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stress test should mainly focus on performance rather than routing correctness since the latter is covered in discovery test. Therefore this action can be a separate workflow and can be further pipelined with other workflows.
with: | ||
name: router-stress-test-results-pr-${{ github.event.pull_request.number || 'main' }} | ||
path: | | ||
${{ env.LOG_DIR }}/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more concise result that better showcases the performance can be uploaded, as the current log file is too large.
@@ -231,3 +231,62 @@ jobs: | |||
pkill -f "python3 -m src.vllm_router.app" || true | |||
|
|||
- run: echo "🍏 Static discovery e2e test job status is ${{ job.status }}." | |||
|
|||
router-stress-test: | |||
runs-on: self-hosted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can GitHub-hosted Runner handle this task since no real model is needed?
--requests 100000 \ | ||
--port 30080 \ | ||
--log-dir "$LOG_DIR" \ | ||
--model "facebook/opt-125m" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to deprecate these fields to improve clarity.
request.app.state.request_stats_monitor.on_request_response( | ||
backend_url, request_id, time.time() | ||
) | ||
yield mock_response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a stub HTTP server or dummy backend server in this way we do no need to modify the source code?
This PR
TODO: will add stress tests for prefix-aware and session-based router later
TODO: will add performance comparison tests & threshold later
This test uses MOCK backends and MOCK responses (no real vLLM servers)
Backend ports are dummy placeholders - no actual services run on them
Model names are dummy placeholders - no real models are loaded
When VLLM_ROUTER_STRESS_TEST_MODE=true, the router returns mock responses
instead of forwarding requests to backends (see src/vllm_router/services/request_service/request.py)
This test validates ONLY the router's routing logic, load balancing, and performance
under high concurrent loads, not actual inference capabilities
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-s
when doinggit commit
[Bugfix]
,[Feat]
, and[CI]
.Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Feat]
for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]
for changes to thevllm_router
(e.g., routing algorithm, router observability, etc.).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-commit
to format your code. SeeREADME.md
for installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:
header which certifies agreement with the terms of the DCO.Using
-s
withgit commit
will automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.