[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf#12337
[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf#12337reasonsolo wants to merge 7 commits intoNVIDIA:mainfrom
Conversation
ed09381 to
0652b35
Compare
📝 WalkthroughWalkthroughThis pull request introduces aiperf-based disaggregated serving benchmarks via a new Bash script and updates benchmark submission logic to support router configuration and aiperf selection. The router component is enhanced with tokenizer caching and ChatCompletionRequest tokenization support. Changes
Sequence Diagram(s)sequenceDiagram
participant Script as SLURM Benchmark Script
participant Env as Environment/System
participant Server as HTTP Server
participant Aiperf as aiperf Tool
participant Metrics as Metrics Endpoint
Script->>Env: Validate 10+ arguments
Script->>Env: Force-reinstall aiperf
alt ucx_warmup_requests > 0
Script->>Server: Run warmup with random inputs (non-streaming)
Server-->>Script: Warmup complete
end
Script->>Env: Export HF_HUB_TRUST_REMOTE_CODE=1
loop For each concurrency value
Script->>Env: Create artifact directory
Script->>Aiperf: Run profile with concurrency, mooncake_trace dataset
Aiperf->>Server: Send profiling requests to http://hostname:port
Server-->>Aiperf: Return responses
Aiperf-->>Script: Benchmark complete, artifacts saved
end
Script->>Metrics: Fetch /perf_metrics
Metrics-->>Script: Return performance metrics JSON
Script->>Env: Save metrics to log_path/perf_metrics.json
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can approve the review once all CodeRabbit's comments are resolved.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh (1)
16-26: Unused variables:num_gen_serversandstreaming.These variables are assigned but never used. Notably, the
aiperf profilecommand always uses--streamingunconditionally (line 76), ignoring the$streamingargument.If
--streamingshould be conditional based on the argument, consider:- --streaming \ + $( [ "$streaming" = "true" ] && echo "--streaming" ) \Otherwise, if they're intentionally unused (kept for argument consistency with other scripts), prefix them with
_to suppress warnings:-num_gen_servers=$4 +_num_gen_servers=$4 ... -streaming=$6 +_streaming=$6🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh` around lines 16 - 26, The script defines unused variables num_gen_servers and streaming and always passes --streaming to the aiperf profile invocation; either make streaming drive the flag or mark unused vars to avoid warnings: update the aiperf profile call (the line that constructs the command invoking "aiperf profile") to include --streaming only when the variable streaming is truthy, or alternatively rename num_gen_servers and streaming to _num_gen_servers and _streaming to signal they are intentionally unused; ensure you only touch the variable names and the conditional around the aiperf profile call (or the renames) so no other behavior changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`:
- Around line 10-14: The error message in run_benchmark_aiperf.sh mixes "$@"
with other strings which triggers shellcheck SC2145; update the echo that prints
arguments to use "$*" (or echo the prefix and then "$@" as a separate argument)
so argument expansion is safe—locate the if block checking "$#" and modify the
line echo "Error: Missing required arguments, got $# arguments, args: $@" to use
"$*" (or split into two echo args) to avoid unexpected expansion.
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Around line 116-117: The function signature for
convert_allocations_to_server_config is not formatted to project style; reflow
its parameters so yapf accepts it (e.g., put each parameter on its own indented
line or otherwise match the repo's function-signature wrapping rules) — ensure
server_port and router_config appear on separate wrapped lines and the
opening/closing parens align with project style, then run yapf (or the project's
formatter) to verify the change.
---
Nitpick comments:
In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`:
- Around line 16-26: The script defines unused variables num_gen_servers and
streaming and always passes --streaming to the aiperf profile invocation; either
make streaming drive the flag or mark unused vars to avoid warnings: update the
aiperf profile call (the line that constructs the command invoking "aiperf
profile") to include --streaming only when the variable streaming is truthy, or
alternatively rename num_gen_servers and streaming to _num_gen_servers and
_streaming to signal they are intentionally unused; ensure you only touch the
variable names and the conditional around the aiperf profile call (or the
renames) so no other behavior changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ca144201-c106-4f9e-ae98-4fcbf84a6cf7
📒 Files selected for processing (3)
examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.shexamples/disaggregated/slurm/benchmark/submit.pytensorrt_llm/serve/router.py
|
/bot run --disable-fail-fast |
|
PR_Github #39591 [ run ] triggered by Bot. Commit: |
|
PR_Github #39591 [ run ] completed with state
|
64a6258 to
d5c1d1d
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #39730 [ run ] triggered by Bot. Commit: |
|
PR_Github #39730 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #39867 [ run ] triggered by Bot. Commit: |
|
PR_Github #39867 [ run ] completed with state
|
a47d932 to
54f56ac
Compare
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #40228 [ run ] triggered by Bot. Commit: |
|
PR_Github #40228 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40329 [ run ] triggered by Bot. Commit: |
|
PR_Github #40329 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40476 [ run ] triggered by Bot. Commit: |
|
PR_Github #40476 [ run ] completed with state
|
bfda312 to
41888e8
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #40480 [ run ] triggered by Bot. Commit: |
|
PR_Github #40480 [ run ] completed with state
|
… support to disagg benchmark - Fix KvCacheAwareRouter._tokenize() to handle ChatCompletionRequest with models whose apply_chat_template returns a string (e.g. Kimi-K2.5). Use two-step tokenize: apply_chat_template(tokenize=False) then encode(). - Extract _get_tokenizer() helper to avoid duplication. - Add use_aiperf benchmark mode to submit.py for mooncake_trace datasets. - Add router_config support to convert_allocations_to_server_config() so kv_cache_aware routing is included in the generated server_config.yaml. - Add server_config_extra merge for perf_metrics and other server options. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…marking Script used by submit.py when use_aiperf=true. Installs the correct aiperf version with trust_remote_code support and runs mooncake_trace dataset benchmarks against the disagg serving endpoint. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…id re-tokenization KvCacheAwareRouter now sets prompt_token_ids (ChatCompletionRequest) or replaces prompt with token IDs (CompletionRequest) after tokenizing, so the downstream worker server skips redundant tokenization. Also adds proper ChatCompletionRequest handling via apply_chat_template. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…eRouter Test that consecutive turns of a multi-turn conversation are routed to the same server due to KV cache prefix hits. Simulates two concurrent sessions with disjoint token sequences inspired by agentic_data traces, verifying that accumulated context prefixes correctly match cached blocks across 3 turns. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…hat APIs
Add api_type parametrize ("completion", "chat") to
test_kv_cache_aware_router_multi_turn_conversation so it exercises both
CompletionRequest (prompt with token IDs) and ChatCompletionRequest
(prompt_token_ids) code paths through the KvCacheAwareRouter.
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…v var Allow overriding the KvCacheAwareRouter tokens_per_block via environment variable. When set, the env var takes precedence over the constructor default (32) and the YAML config value. The effective value is logged at router initialization. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
41888e8 to
2d0e8d1
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #40504 [ run ] triggered by Bot. Commit: |
|
PR_Github #40504 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40521 [ run ] triggered by Bot. Commit: |
|
PR_Github #40521 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40539 [ run ] triggered by Bot. Commit: |
|
PR_Github #40539 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.