[CI/Build] Stress tests on router with round-robin #562

kobe0938 · 2025-07-02T07:48:36Z

This PR

verifies that the router correctly distributes requests using round-robin logic and while handling high concurrency without routing logic failures
adds a github action to stress test round-robin router with 20k concurrent requests
records the log and performance metrics of round-robin router for future comparison with other routing logics

TODO: will add stress tests for prefix-aware and session-based router later
TODO: will add performance comparison tests & threshold later

This test uses MOCK backends and MOCK responses (no real vLLM servers)
Backend ports are dummy placeholders - no actual services run on them
Model names are dummy placeholders - no real models are loaded
When VLLM_ROUTER_STRESS_TEST_MODE=true, the router returns mock responses
instead of forwarding requests to backends (see src/vllm_router/services/request_service/request.py)
This test validates ONLY the router's routing logic, load balancing, and performance
under high concurrent loads, not actual inference capabilities

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

Signed-off-by: Kobe Chen <[email protected]>

… and logging paths accordingly. Signed-off-by: Kobe Chen <[email protected]>

Signed-off-by: Kobe Chen <[email protected]>

gemini-code-assist

Summary of Changes

Hello @kobe0938, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dedicated and robust end-to-end stress testing framework for the vLLM router. It enables comprehensive validation of the router's round-robin load balancing capabilities and performance under high concurrent loads by simulating requests and mock responses, thereby isolating the router's behavior from actual backend inference processes.

Highlights

Mock Response Mechanism for Stress Testing: Introduced a new environment variable, VLLM_ROUTER_STRESS_TEST_MODE. When this variable is set to true, the router's request processing (src/vllm_router/services/request_service/request.py lines 99-110) is short-circuited to return mock responses directly. This allows for isolated stress testing of the router's internal logic and performance without requiring actual vLLM backend servers, while still ensuring request_stats_monitor calls are made for metric collection.
Comprehensive Round-Robin Stress Test Script: Added a new end-to-end bash script (tests/e2e/router-stress-test.sh) designed to stress test the router's round-robin load balancing. This script automates the process of: starting the router in stress test mode, configuring it with static mock backends and round-robin logic, using Apache Bench (ab) to generate a high volume of concurrent requests (defaulting to 20,000 concurrent and 100,000 total requests), and crucially, analyzing router logs to verify the correctness of the round-robin distribution (ensuring requests are evenly distributed within a 20% tolerance).
Router Performance and Correctness Validation: The primary objective of these changes is to verify that the router correctly distributes requests using round-robin logic and can handle high concurrency without routing failures. It also aims to record performance metrics under these conditions for future comparisons with other routing algorithms.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a stress test for the router's round-robin logic, adding a mock response mode to the request service and a new shell script for execution. The changes are generally well-implemented, but there are important considerations regarding production safety, code maintainability, and script correctness that are detailed in the review comments.

gemini-code-assist · 2025-07-02T07:51:14Z

src/vllm_router/services/request_service/request.py

+    if os.getenv("VLLM_ROUTER_STRESS_TEST_MODE", "false").lower() == "true":
+        # Mock response for stress testing - skip backend calls
+        mock_headers = {"content-type": "application/json", "x-request-id": request_id}
+        mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'
+
+        # Yield headers and mock response
+        yield mock_headers, 200
+        request.app.state.request_stats_monitor.on_request_response(backend_url, request_id, time.time())
+        yield mock_response
+        request.app.state.request_stats_monitor.on_request_complete(backend_url, request_id, time.time())
+        return


Adding a test-only mode controlled by an environment variable directly in the request path introduces a potential risk. If VLLM_ROUTER_STRESS_TEST_MODE is accidentally set in a production environment, the router will serve only mock responses, causing a service outage. A safer approach is to check this environment variable only at application startup and log a prominent warning if it's enabled. This makes the application's mode of operation clear and avoids the per-request overhead of os.getenv. While changing app.py is outside this PR's scope, it's a critical consideration for production safety.

gemini-code-assist · 2025-07-02T07:51:15Z

src/vllm_router/services/request_service/request.py

+    if os.getenv("VLLM_ROUTER_STRESS_TEST_MODE", "false").lower() == "true":
+        # Mock response for stress testing - skip backend calls
+        mock_headers = {"content-type": "application/json", "x-request-id": request_id}
+        mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'


Hardcoding a JSON string makes it less readable and harder to maintain. It's better to define the response as a Python dictionary and serialize it to JSON. This improves clarity and simplifies future modifications.

Suggested change

mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'

mock_response = json.dumps({

"id": "test",

"object": "chat.completion",

"choices": [{

"message": {"role": "assistant", "content": "Test"},

"index": 0,

"finish_reason": "stop"

}]

}).encode("utf-8")

tests/e2e/router-stress-test.sh

gemini-code-assist · 2025-07-02T07:51:15Z

tests/e2e/router-stress-test.sh

+    print_status "Concurrent: $CONCURRENT, Total: $REQUESTS"
+
+    # Create payload file
+    local payload_file="/tmp/stress_payload.json"


Using a fixed filename in /tmp for the payload file can cause conflicts if multiple instances of this script run concurrently. This can lead to race conditions and unpredictable behavior. It's safer to use mktemp to generate a unique temporary filename.

Suggested change

local payload_file="/tmp/stress_payload.json"

local payload_file=$(mktemp /tmp/stress_payload.XXXXXX.json)

Signed-off-by: Kobe Chen <[email protected]>

per gemini's suggestion Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kobe Chen <[email protected]>

Signed-off-by: Kobe Chen <[email protected]>

Shaoting-Feng · 2025-07-09T22:56:27Z

.github/workflows/router-e2e-test.yml

+
+  router-stress-test:
+    runs-on: self-hosted
+    needs: e2e-test


The stress test should mainly focus on performance rather than routing correctness since the latter is covered in discovery test. Therefore this action can be a separate workflow and can be further pipelined with other workflows.

Shaoting-Feng · 2025-07-09T22:57:58Z

.github/workflows/router-e2e-test.yml

+        with:
+          name: router-stress-test-results-pr-${{ github.event.pull_request.number || 'main' }}
+          path: |
+            ${{ env.LOG_DIR }}/*


A more concise result that better showcases the performance can be uploaded, as the current log file is too large.

Shaoting-Feng · 2025-07-09T22:58:55Z

.github/workflows/router-e2e-test.yml

@@ -231,3 +231,62 @@ jobs:
          pkill -f "python3 -m src.vllm_router.app" || true

      - run: echo "🍏 Static discovery e2e test job status is ${{ job.status }}."
+
+  router-stress-test:
+    runs-on: self-hosted


Can GitHub-hosted Runner handle this task since no real model is needed?

Shaoting-Feng · 2025-07-09T22:59:53Z

.github/workflows/router-e2e-test.yml

+            --requests 100000 \
+            --port 30080 \
+            --log-dir "$LOG_DIR" \
+            --model "facebook/opt-125m" \


It would be better to deprecate these fields to improve clarity.

Shaoting-Feng · 2025-07-09T23:04:11Z

src/vllm_router/services/request_service/request.py

+        request.app.state.request_stats_monitor.on_request_response(
+            backend_url, request_id, time.time()
+        )
+        yield mock_response


Can we use a stub HTTP server or dummy backend server in this way we do no need to modify the source code?

kobe0938 added 6 commits July 1, 2025 22:59

add simple round-robin stress test for router

e2ea70a

Signed-off-by: Kobe Chen <[email protected]>

add github action

48c0892

Signed-off-by: Kobe Chen <[email protected]>

test github action on stress test

3734898

Signed-off-by: Kobe Chen <[email protected]>

Rename stress test job and script for clarity; update artifact naming…

4265fdb

… and logging paths accordingly. Signed-off-by: Kobe Chen <[email protected]>

add comments

e03b3a6

Signed-off-by: Kobe Chen <[email protected]>

bring back other tests and increase load

d58f4ca

Signed-off-by: Kobe Chen <[email protected]>

gemini-code-assist bot reviewed Jul 2, 2025

View reviewed changes

kobe0938 and others added 2 commits July 2, 2025 00:54

pass pre-commit

ca4515b

Signed-off-by: Kobe Chen <[email protected]>

Update tests/e2e/router-stress-test.sh

0e79e3d

per gemini's suggestion Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Kobe Chen <[email protected]>

kobe0938 force-pushed the stress-test-router branch from 37fda5d to 0e79e3d Compare July 2, 2025 08:09

pass shellcheck

57dfd52

Signed-off-by: Kobe Chen <[email protected]>

YuhanLiu11 mentioned this pull request Jul 8, 2025

[Roadmap] vLLM Production Stack roadmap for 2025 Q2 #300

Open

31 tasks

Shaoting-Feng reviewed Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI/Build] Stress tests on router with round-robin #562

[CI/Build] Stress tests on router with round-robin #562

Uh oh!

kobe0938 commented Jul 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 2, 2025

Uh oh!

gemini-code-assist bot Jul 2, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Jul 2, 2025

Uh oh!

Shaoting-Feng Jul 9, 2025

Uh oh!

Shaoting-Feng Jul 9, 2025

Uh oh!

Shaoting-Feng Jul 9, 2025

Uh oh!

Shaoting-Feng Jul 9, 2025

Uh oh!

Shaoting-Feng Jul 9, 2025

Uh oh!

Uh oh!

-        mock_response = b'{"id":"test","object":"chat.completion","choices":[{"message":{"role":"assistant","content":"Test"},"index":0,"finish_reason":"stop"}]}'
+        mock_response = json.dumps({
+            "id": "test",
+            "object": "chat.completion",
+            "choices": [{
+                "message": {"role": "assistant", "content": "Test"},
+                "index": 0,
+                "finish_reason": "stop"
+            }]
+        }).encode("utf-8")

	local payload_file="/tmp/stress_payload.json"
	local payload_file=$(mktemp /tmp/stress_payload.XXXXXX.json)

[CI/Build] Stress tests on router with round-robin #562

Are you sure you want to change the base?

[CI/Build] Stress tests on router with round-robin #562

Uh oh!

Conversation

kobe0938 commented Jul 2, 2025

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Shaoting-Feng Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Shaoting-Feng Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Shaoting-Feng Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Shaoting-Feng Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Shaoting-Feng Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!