[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf by reasonsolo · Pull Request #12337 · NVIDIA/TensorRT-LLM

reasonsolo · 2026-03-19T04:08:32Z

Summary by CodeRabbit

New Features
- Added aiperf-based benchmarking capability for disaggregated serving with configurable concurrency levels, streaming support, and performance metrics collection.
- Enhanced chat completion request handling with optimized tokenizer caching to improve performance.
- Added router configuration support in benchmark submission workflows.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-19T04:59:42Z

📝 Walkthrough

Walkthrough

This pull request introduces aiperf-based disaggregated serving benchmarks via a new Bash script and updates benchmark submission logic to support router configuration and aiperf selection. The router component is enhanced with tokenizer caching and ChatCompletionRequest tokenization support.

Changes

Cohort / File(s)	Summary
aiperf Benchmark Script `examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`	New executable script orchestrating aiperf benchmarks with argument validation, aiperf installation, optional warmup, concurrency-based benchmark loops, and metrics collection from server.
Benchmark Submission Updates `examples/disaggregated/slurm/benchmark/submit.py`	Added `router_config` parameter to `convert_allocations_to_server_config`, integrated aiperf benchmark selection logic with `use_aiperf` flag, and added `server_config_extra` merging support.
Router Tokenization Enhancements `tensorrt_llm/serve/router.py`	Added tokenizer caching via `_get_tokenizer()`, extended `_tokenize()` to handle `ChatCompletionRequest` with pre-tokenized prompt support and `apply_chat_template()` fallback, and updated `CompletionRequest` tokenization to use cached tokenizers.

Sequence Diagram(s)

sequenceDiagram
    participant Script as SLURM Benchmark Script
    participant Env as Environment/System
    participant Server as HTTP Server
    participant Aiperf as aiperf Tool
    participant Metrics as Metrics Endpoint

    Script->>Env: Validate 10+ arguments
    Script->>Env: Force-reinstall aiperf
    
    alt ucx_warmup_requests > 0
        Script->>Server: Run warmup with random inputs (non-streaming)
        Server-->>Script: Warmup complete
    end
    
    Script->>Env: Export HF_HUB_TRUST_REMOTE_CODE=1
    
    loop For each concurrency value
        Script->>Env: Create artifact directory
        Script->>Aiperf: Run profile with concurrency, mooncake_trace dataset
        Aiperf->>Server: Send profiling requests to http://hostname:port
        Server-->>Aiperf: Return responses
        Aiperf-->>Script: Benchmark complete, artifacts saved
    end
    
    Script->>Metrics: Fetch /perf_metrics
    Metrics-->>Script: Return performance metrics JSON
    Script->>Env: Save metrics to log_path/perf_metrics.json

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is largely empty—only the template structure is present with no substantive explanation of changes, rationale, test coverage, or checklist completion beyond a single unchecked item.	Fill in the Description section explaining the issue and solution, document test coverage, and complete the PR Checklist items appropriately for the changes made.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: fixing the disaggregated KV-cache router for chat API support and adding disaggregated benchmark support for AI perf.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh (1)
16-26: Unused variables: num_gen_servers and streaming.

These variables are assigned but never used. Notably, the aiperf profile command always uses --streaming unconditionally (line 76), ignoring the $streaming argument.

If --streaming should be conditional based on the argument, consider:
-        --streaming \
+        $( [ "$streaming" = "true" ] && echo "--streaming" ) \
Otherwise, if they're intentionally unused (kept for argument consistency with other scripts), prefix them with _ to suppress warnings:
-num_gen_servers=$4
+_num_gen_servers=$4
 ...
-streaming=$6
+_streaming=$6
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh` around lines
16 - 26, The script defines unused variables num_gen_servers and streaming and
always passes --streaming to the aiperf profile invocation; either make
streaming drive the flag or mark unused vars to avoid warnings: update the
aiperf profile call (the line that constructs the command invoking "aiperf
profile") to include --streaming only when the variable streaming is truthy, or
alternatively rename num_gen_servers and streaming to _num_gen_servers and
_streaming to signal they are intentionally unused; ensure you only touch the
variable names and the conditional around the aiperf profile call (or the
renames) so no other behavior changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`:
- Around line 10-14: The error message in run_benchmark_aiperf.sh mixes "$@"
with other strings which triggers shellcheck SC2145; update the echo that prints
arguments to use "$*" (or echo the prefix and then "$@" as a separate argument)
so argument expansion is safe—locate the if block checking "$#" and modify the
line echo "Error: Missing required arguments, got $# arguments, args: $@" to use
"$*" (or split into two echo args) to avoid unexpected expansion.

In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Around line 116-117: The function signature for
convert_allocations_to_server_config is not formatted to project style; reflow
its parameters so yapf accepts it (e.g., put each parameter on its own indented
line or otherwise match the repo's function-signature wrapping rules) — ensure
server_port and router_config appear on separate wrapped lines and the
opening/closing parens align with project style, then run yapf (or the project's
formatter) to verify the change.

---

Nitpick comments:
In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`:
- Around line 16-26: The script defines unused variables num_gen_servers and
streaming and always passes --streaming to the aiperf profile invocation; either
make streaming drive the flag or mark unused vars to avoid warnings: update the
aiperf profile call (the line that constructs the command invoking "aiperf
profile") to include --streaming only when the variable streaming is truthy, or
alternatively rename num_gen_servers and streaming to _num_gen_servers and
_streaming to signal they are intentionally unused; ensure you only touch the
variable names and the conditional around the aiperf profile call (or the
renames) so no other behavior changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca144201-c106-4f9e-ae98-4fcbf84a6cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 2fb5e80 and 0652b35.

📒 Files selected for processing (3)

examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh
examples/disaggregated/slurm/benchmark/submit.py
tensorrt_llm/serve/router.py

examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh

examples/disaggregated/slurm/benchmark/submit.py

reasonsolo · 2026-03-19T10:44:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-19T10:50:56Z

PR_Github #39591 [ run ] triggered by Bot. Commit: 0652b35 Link to invocation

tensorrt-cicd · 2026-03-19T18:14:48Z

PR_Github #39591 [ run ] completed with state SUCCESS. Commit: 0652b35
/LLM/main/L0_MergeRequest_PR pipeline #30804 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-20T09:55:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-20T10:00:49Z

PR_Github #39730 [ run ] triggered by Bot. Commit: d5c1d1d Link to invocation

tensorrt-cicd · 2026-03-20T18:25:44Z

PR_Github #39730 [ run ] completed with state SUCCESS. Commit: d5c1d1d
/LLM/main/L0_MergeRequest_PR pipeline #30925 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-23T03:05:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-23T03:11:20Z

PR_Github #39867 [ run ] triggered by Bot. Commit: a47d932 Link to invocation

tensorrt-cicd · 2026-03-23T09:59:03Z

PR_Github #39867 [ run ] completed with state SUCCESS. Commit: a47d932
/LLM/main/L0_MergeRequest_PR pipeline #31039 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-23T11:57:59Z

/bot run --disable-fail-fast

pcastonguay · 2026-03-25T13:31:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-25T13:37:23Z

PR_Github #40329 [ run ] triggered by Bot. Commit: bfda312 Link to invocation

tensorrt-cicd · 2026-03-25T18:40:56Z

PR_Github #40329 [ run ] completed with state SUCCESS. Commit: bfda312
/LLM/main/L0_MergeRequest_PR pipeline #31438 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-27T02:20:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T02:26:21Z

PR_Github #40476 [ run ] triggered by Bot. Commit: bfda312 Link to invocation

tensorrt-cicd · 2026-03-27T04:10:16Z

PR_Github #40476 [ run ] completed with state SUCCESS. Commit: bfda312
/LLM/main/L0_MergeRequest_PR pipeline #31563 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-27T04:25:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T04:31:44Z

PR_Github #40480 [ run ] triggered by Bot. Commit: 41888e8 Link to invocation

tensorrt-cicd · 2026-03-27T14:19:27Z

PR_Github #40480 [ run ] completed with state SUCCESS. Commit: 41888e8
/LLM/main/L0_MergeRequest_PR pipeline #31568 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

… support to disagg benchmark - Fix KvCacheAwareRouter._tokenize() to handle ChatCompletionRequest with models whose apply_chat_template returns a string (e.g. Kimi-K2.5). Use two-step tokenize: apply_chat_template(tokenize=False) then encode(). - Extract _get_tokenizer() helper to avoid duplication. - Add use_aiperf benchmark mode to submit.py for mooncake_trace datasets. - Add router_config support to convert_allocations_to_server_config() so kv_cache_aware routing is included in the generated server_config.yaml. - Add server_config_extra merge for perf_metrics and other server options. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

…marking Script used by submit.py when use_aiperf=true. Installs the correct aiperf version with trust_remote_code support and runs mooncake_trace dataset benchmarks against the disagg serving endpoint. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

…id re-tokenization KvCacheAwareRouter now sets prompt_token_ids (ChatCompletionRequest) or replaces prompt with token IDs (CompletionRequest) after tokenizing, so the downstream worker server skips redundant tokenization. Also adds proper ChatCompletionRequest handling via apply_chat_template. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

…eRouter Test that consecutive turns of a multi-turn conversation are routed to the same server due to KV cache prefix hits. Simulates two concurrent sessions with disjoint token sequences inspired by agentic_data traces, verifying that accumulated context prefixes correctly match cached blocks across 3 turns. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

…hat APIs Add api_type parametrize ("completion", "chat") to test_kv_cache_aware_router_multi_turn_conversation so it exercises both CompletionRequest (prompt with token IDs) and ChatCompletionRequest (prompt_token_ids) code paths through the KvCacheAwareRouter. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

…v var Allow overriding the KvCacheAwareRouter tokens_per_block via environment variable. When set, the env var takes precedence over the constructor default (32) and the YAML config value. The effective value is logged at router initialization. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo · 2026-03-27T14:26:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T14:32:39Z

PR_Github #40504 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

tensorrt-cicd · 2026-03-27T21:43:51Z

PR_Github #40504 [ run ] completed with state SUCCESS. Commit: 2d0e8d1
/LLM/main/L0_MergeRequest_PR pipeline #31592 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-28T02:31:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-28T02:37:09Z

PR_Github #40521 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

tensorrt-cicd · 2026-03-28T04:39:45Z

PR_Github #40521 [ run ] completed with state SUCCESS. Commit: 2d0e8d1
/LLM/main/L0_MergeRequest_PR pipeline #31607 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · 2026-03-28T11:38:51Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-28T11:44:45Z

PR_Github #40539 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

tensorrt-cicd · 2026-03-28T11:44:46Z

PR_Github #40539 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

reasonsolo · 2026-03-28T16:45:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-28T16:52:32Z

PR_Github #40550 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

tensorrt-cicd · 2026-03-28T16:52:33Z

PR_Github #40550 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

github-actions bot assigned reasonsolo Mar 19, 2026

reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from ed09381 to 0652b35 Compare March 19, 2026 04:50

reasonsolo marked this pull request as ready for review March 19, 2026 04:56

reasonsolo requested review from a team as code owners March 19, 2026 04:56

reasonsolo requested review from Tabrizian, chuangz0, laikhtewari and nv-guomingz March 19, 2026 04:56

reasonsolo changed the title ~~[None][fix] fix disagg kvcache router for chat API~~ [None][fix] fix disagg kvcache router for chat API; add disagg benchmark for ai_perf Mar 19, 2026

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh Show resolved Hide resolved

examples/disaggregated/slurm/benchmark/submit.py Outdated Show resolved Hide resolved

reasonsolo changed the title ~~[None][fix] fix disagg kvcache router for chat API; add disagg benchmark for ai_perf~~ [None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf Mar 19, 2026

pcastonguay approved these changes Mar 19, 2026

View reviewed changes

reasonsolo changed the title ~~[None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf~~ [None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf Mar 20, 2026

reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from 64a6258 to d5c1d1d Compare March 20, 2026 09:55

reasonsolo changed the title ~~[None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf~~ [TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf Mar 20, 2026

reasonsolo enabled auto-merge (squash) March 23, 2026 03:37

reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from a47d932 to 54f56ac Compare March 23, 2026 11:53

reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from bfda312 to 41888e8 Compare March 27, 2026 04:24

reasonsolo added 7 commits March 27, 2026 22:26

[None][chore] Fix yapf formatting in router and benchmark files

2d0e8d1

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from 41888e8 to 2d0e8d1 Compare March 27, 2026 14:26

Conversation

reasonsolo commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

reasonsolo commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

reasonsolo commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

reasonsolo commented Mar 23, 2026

Uh oh!

pcastonguay commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

reasonsolo commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

reasonsolo commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

reasonsolo commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

reasonsolo commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

reasonsolo commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

reasonsolo commented Mar 28, 2026

Uh oh!

reasonsolo commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading