Skip to content

[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf#12337

Open
reasonsolo wants to merge 7 commits intoNVIDIA:mainfrom
reasonsolo:lizhiz/disagg-kvcache-router-fix
Open

[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf#12337
reasonsolo wants to merge 7 commits intoNVIDIA:mainfrom
reasonsolo:lizhiz/disagg-kvcache-router-fix

Conversation

@reasonsolo
Copy link
Copy Markdown
Collaborator

@reasonsolo reasonsolo commented Mar 19, 2026

Summary by CodeRabbit

  • New Features
    • Added aiperf-based benchmarking capability for disaggregated serving with configurable concurrency levels, streaming support, and performance metrics collection.
    • Enhanced chat completion request handling with optimized tokenizer caching to improve performance.
    • Added router configuration support in benchmark submission workflows.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@reasonsolo reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from ed09381 to 0652b35 Compare March 19, 2026 04:50
@reasonsolo reasonsolo marked this pull request as ready for review March 19, 2026 04:56
@reasonsolo reasonsolo requested review from a team as code owners March 19, 2026 04:56
@reasonsolo reasonsolo changed the title [None][fix] fix disagg kvcache router for chat API [None][fix] fix disagg kvcache router for chat API; add disagg benchmark for ai_perf Mar 19, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

This pull request introduces aiperf-based disaggregated serving benchmarks via a new Bash script and updates benchmark submission logic to support router configuration and aiperf selection. The router component is enhanced with tokenizer caching and ChatCompletionRequest tokenization support.

Changes

Cohort / File(s) Summary
aiperf Benchmark Script
examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh
New executable script orchestrating aiperf benchmarks with argument validation, aiperf installation, optional warmup, concurrency-based benchmark loops, and metrics collection from server.
Benchmark Submission Updates
examples/disaggregated/slurm/benchmark/submit.py
Added router_config parameter to convert_allocations_to_server_config, integrated aiperf benchmark selection logic with use_aiperf flag, and added server_config_extra merging support.
Router Tokenization Enhancements
tensorrt_llm/serve/router.py
Added tokenizer caching via _get_tokenizer(), extended _tokenize() to handle ChatCompletionRequest with pre-tokenized prompt support and apply_chat_template() fallback, and updated CompletionRequest tokenization to use cached tokenizers.

Sequence Diagram(s)

sequenceDiagram
    participant Script as SLURM Benchmark Script
    participant Env as Environment/System
    participant Server as HTTP Server
    participant Aiperf as aiperf Tool
    participant Metrics as Metrics Endpoint

    Script->>Env: Validate 10+ arguments
    Script->>Env: Force-reinstall aiperf
    
    alt ucx_warmup_requests > 0
        Script->>Server: Run warmup with random inputs (non-streaming)
        Server-->>Script: Warmup complete
    end
    
    Script->>Env: Export HF_HUB_TRUST_REMOTE_CODE=1
    
    loop For each concurrency value
        Script->>Env: Create artifact directory
        Script->>Aiperf: Run profile with concurrency, mooncake_trace dataset
        Aiperf->>Server: Send profiling requests to http://hostname:port
        Server-->>Aiperf: Return responses
        Aiperf-->>Script: Benchmark complete, artifacts saved
    end
    
    Script->>Metrics: Fetch /perf_metrics
    Metrics-->>Script: Return performance metrics JSON
    Script->>Env: Save metrics to log_path/perf_metrics.json
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is largely empty—only the template structure is present with no substantive explanation of changes, rationale, test coverage, or checklist completion beyond a single unchecked item. Fill in the Description section explaining the issue and solution, document test coverage, and complete the PR Checklist items appropriately for the changes made.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing the disaggregated KV-cache router for chat API support and adding disaggregated benchmark support for AI perf.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh (1)

16-26: Unused variables: num_gen_servers and streaming.

These variables are assigned but never used. Notably, the aiperf profile command always uses --streaming unconditionally (line 76), ignoring the $streaming argument.

If --streaming should be conditional based on the argument, consider:

-        --streaming \
+        $( [ "$streaming" = "true" ] && echo "--streaming" ) \

Otherwise, if they're intentionally unused (kept for argument consistency with other scripts), prefix them with _ to suppress warnings:

-num_gen_servers=$4
+_num_gen_servers=$4
 ...
-streaming=$6
+_streaming=$6
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh` around lines
16 - 26, The script defines unused variables num_gen_servers and streaming and
always passes --streaming to the aiperf profile invocation; either make
streaming drive the flag or mark unused vars to avoid warnings: update the
aiperf profile call (the line that constructs the command invoking "aiperf
profile") to include --streaming only when the variable streaming is truthy, or
alternatively rename num_gen_servers and streaming to _num_gen_servers and
_streaming to signal they are intentionally unused; ensure you only touch the
variable names and the conditional around the aiperf profile call (or the
renames) so no other behavior changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`:
- Around line 10-14: The error message in run_benchmark_aiperf.sh mixes "$@"
with other strings which triggers shellcheck SC2145; update the echo that prints
arguments to use "$*" (or echo the prefix and then "$@" as a separate argument)
so argument expansion is safe—locate the if block checking "$#" and modify the
line echo "Error: Missing required arguments, got $# arguments, args: $@" to use
"$*" (or split into two echo args) to avoid unexpected expansion.

In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Around line 116-117: The function signature for
convert_allocations_to_server_config is not formatted to project style; reflow
its parameters so yapf accepts it (e.g., put each parameter on its own indented
line or otherwise match the repo's function-signature wrapping rules) — ensure
server_port and router_config appear on separate wrapped lines and the
opening/closing parens align with project style, then run yapf (or the project's
formatter) to verify the change.

---

Nitpick comments:
In `@examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh`:
- Around line 16-26: The script defines unused variables num_gen_servers and
streaming and always passes --streaming to the aiperf profile invocation; either
make streaming drive the flag or mark unused vars to avoid warnings: update the
aiperf profile call (the line that constructs the command invoking "aiperf
profile") to include --streaming only when the variable streaming is truthy, or
alternatively rename num_gen_servers and streaming to _num_gen_servers and
_streaming to signal they are intentionally unused; ensure you only touch the
variable names and the conditional around the aiperf profile call (or the
renames) so no other behavior changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca144201-c106-4f9e-ae98-4fcbf84a6cf7

📥 Commits

Reviewing files that changed from the base of the PR and between 2fb5e80 and 0652b35.

📒 Files selected for processing (3)
  • examples/disaggregated/slurm/benchmark/run_benchmark_aiperf.sh
  • examples/disaggregated/slurm/benchmark/submit.py
  • tensorrt_llm/serve/router.py

@reasonsolo reasonsolo changed the title [None][fix] fix disagg kvcache router for chat API; add disagg benchmark for ai_perf [None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf Mar 19, 2026
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39591 [ run ] triggered by Bot. Commit: 0652b35 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39591 [ run ] completed with state SUCCESS. Commit: 0652b35
/LLM/main/L0_MergeRequest_PR pipeline #30804 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo reasonsolo changed the title [None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf [None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf Mar 20, 2026
@reasonsolo reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from 64a6258 to d5c1d1d Compare March 20, 2026 09:55
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@reasonsolo reasonsolo changed the title [None][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf [TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf Mar 20, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39730 [ run ] triggered by Bot. Commit: d5c1d1d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39730 [ run ] completed with state SUCCESS. Commit: d5c1d1d
/LLM/main/L0_MergeRequest_PR pipeline #30925 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39867 [ run ] triggered by Bot. Commit: a47d932 Link to invocation

@reasonsolo reasonsolo enabled auto-merge (squash) March 23, 2026 03:37
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39867 [ run ] completed with state SUCCESS. Commit: a47d932
/LLM/main/L0_MergeRequest_PR pipeline #31039 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from a47d932 to 54f56ac Compare March 23, 2026 11:53
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@pcastonguay
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40329 [ run ] triggered by Bot. Commit: bfda312 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40329 [ run ] completed with state SUCCESS. Commit: bfda312
/LLM/main/L0_MergeRequest_PR pipeline #31438 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40476 [ run ] triggered by Bot. Commit: bfda312 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40476 [ run ] completed with state SUCCESS. Commit: bfda312
/LLM/main/L0_MergeRequest_PR pipeline #31563 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from bfda312 to 41888e8 Compare March 27, 2026 04:24
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40480 [ run ] triggered by Bot. Commit: 41888e8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40480 [ run ] completed with state SUCCESS. Commit: 41888e8
/LLM/main/L0_MergeRequest_PR pipeline #31568 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

… support to disagg benchmark

- Fix KvCacheAwareRouter._tokenize() to handle ChatCompletionRequest
  with models whose apply_chat_template returns a string (e.g. Kimi-K2.5).
  Use two-step tokenize: apply_chat_template(tokenize=False) then encode().
- Extract _get_tokenizer() helper to avoid duplication.
- Add use_aiperf benchmark mode to submit.py for mooncake_trace datasets.
- Add router_config support to convert_allocations_to_server_config() so
  kv_cache_aware routing is included in the generated server_config.yaml.
- Add server_config_extra merge for perf_metrics and other server options.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…marking

Script used by submit.py when use_aiperf=true. Installs the correct
aiperf version with trust_remote_code support and runs mooncake_trace
dataset benchmarks against the disagg serving endpoint.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…id re-tokenization

KvCacheAwareRouter now sets prompt_token_ids (ChatCompletionRequest) or
replaces prompt with token IDs (CompletionRequest) after tokenizing,
so the downstream worker server skips redundant tokenization.

Also adds proper ChatCompletionRequest handling via apply_chat_template.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…eRouter

Test that consecutive turns of a multi-turn conversation are routed to
the same server due to KV cache prefix hits. Simulates two concurrent
sessions with disjoint token sequences inspired by agentic_data traces,
verifying that accumulated context prefixes correctly match cached blocks
across 3 turns.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…hat APIs

Add api_type parametrize ("completion", "chat") to
test_kv_cache_aware_router_multi_turn_conversation so it exercises both
CompletionRequest (prompt with token IDs) and ChatCompletionRequest
(prompt_token_ids) code paths through the KvCacheAwareRouter.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…v var

Allow overriding the KvCacheAwareRouter tokens_per_block via environment
variable. When set, the env var takes precedence over the constructor
default (32) and the YAML config value. The effective value is logged
at router initialization.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo reasonsolo force-pushed the lizhiz/disagg-kvcache-router-fix branch from 41888e8 to 2d0e8d1 Compare March 27, 2026 14:26
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40504 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40504 [ run ] completed with state SUCCESS. Commit: 2d0e8d1
/LLM/main/L0_MergeRequest_PR pipeline #31592 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40521 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40521 [ run ] completed with state SUCCESS. Commit: 2d0e8d1
/LLM/main/L0_MergeRequest_PR pipeline #31607 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40539 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40539 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40550 [ run ] triggered by Bot. Commit: 2d0e8d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40550 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants