[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse by lancelly · Pull Request #12521 · NVIDIA/TensorRT-LLM

lancelly · 2026-03-25T03:17:48Z

Description

_waiting_requests() in py_executor.py uses get_tokens(0) to compute the scheduled context token count for the batch_wait decision. get_tokens(0) returns the full input
sequence length (e.g., ~34K tokens for agentic workloads), which always exceeds the batch_wait_max_tokens_ratio * max_num_tokens threshold. This makes batch_wait_timeout_iters a
complete no-op when KV cache block reuse is enabled.

Fix: Subtract estimated_reusable_tokens (already set by the capacity scheduler's radix tree lookup) to get the actual compute tokens, matching how the micro batch scheduler
itself accounts for cache reuse.

Before (always False → never waits):

num_scheduled_ctx_tokens = sum(
    len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
# → 34742 tokens >> 4096 threshold → should_waiting = False

After (correctly reflects compute cost):
for ctx_req in context_requests:
    req_tokens = len(ctx_req.get_tokens(0))
    reusable = ctx_req.estimated_reusable_tokens if ctx_req.is_first_context_chunk else 0
    num_scheduled_ctx_tokens += max(1, req_tokens - reusable)
# → 1106 compute tokens < 4096 threshold → should_waiting = True

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced token accounting in the request scheduler to accurately reflect KV cache reuse during context processing.

* **Tests**
* Added test coverage for request waiting behavior with various KV cache reuse scenarios.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

coderabbitai · 2026-03-25T03:22:34Z

📝 Walkthrough

Walkthrough

Updated token accounting in _waiting_requests() method to subtract reusable KV cache tokens for context requests and clamp per-request effective token contribution to a minimum of 1, with corresponding test coverage for KV reuse scenarios.

Changes

Cohort / File(s)	Summary
Token Accounting Logic `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Modified `_waiting_requests()` to account for KV cache reuse: replaced direct token length summation with explicit per-request loop that subtracts `estimated_reusable_tokens` for non-first context chunks and clamps effective token count to minimum of 1 per request.
Test Coverage for Waiting Logic `tests/unittest/_torch/executor/test_py_executor.py`	Added helper function `_make_ctx_request`, mock executor `MockPyExecutorForWaiting`, and test suite `TestWaitingRequests` with four test cases covering: no KV reuse, first context chunk reuse, non-first context chunk behavior, and per-request token clamping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: fixing _waiting_requests to account for KV cache reuse when computing scheduled context tokens.
Description check	✅ Passed	The PR description explains the problem, root cause, and solution clearly. However, the Test Coverage section required by the template is missing.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Line 223: The long boolean assignment to should_waiting exceeds line-length
limits; split the expression across lines using parentheses and intermediate
variables if helpful. Locate the assignment to should_waiting (uses
self.batch_wait_iters_count, self.batch_wait_timeout_iters,
num_scheduled_tokens, self.batch_wait_max_tokens_ratio, self.max_num_tokens) and
reformat it so each comparison and the multiplication are on separate lines (or
extract num_scheduled_tokens < ... into a named variable) to keep each line <120
chars while preserving the same logic and operands.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 29b96125-8bf0-4aad-81ad-0f1ef8ebff36

📥 Commits

Reviewing files that changed from the base of the PR and between b130996 and 2bd5013.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/unittest/_torch/executor/test_py_executor.py

tests/unittest/_torch/executor/test_py_executor.py

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly · 2026-03-25T04:28:49Z

/bot run --disable-fail-fast

lancelly · 2026-03-25T05:25:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-25T05:35:17Z

PR_Github #40247 [ run ] triggered by Bot. Commit: fac40e3 Link to invocation

… reuse Cherry-pick from PR NVIDIA#12521. _waiting_requests() was using full input sequence length (get_tokens(0)) which always exceeded the batch_wait threshold when KV cache reuse is enabled. Now subtracts estimated_reusable_tokens to get actual compute tokens. Signed-off-by: Lance Liao <laliao@login-preos02.a51.clusters.nvidia.com> Made-with: Cursor

tensorrt-cicd · 2026-03-25T16:21:46Z

PR_Github #40247 [ run ] completed with state SUCCESS. Commit: fac40e3
/LLM/main/L0_MergeRequest_PR pipeline #31376 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

SimengLiu-nv

The changes in tensorrt_llm/_torch/pyexecutor/py_executor.py looks good to me. Request changes in the testing code.

tests/unittest/_torch/executor/test_py_executor.py

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly · 2026-03-26T03:21:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-26T03:26:50Z

PR_Github #40416 [ run ] triggered by Bot. Commit: 4df69cb Link to invocation

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

tensorrt-cicd · 2026-03-26T13:43:19Z

PR_Github #40416 [ run ] completed with state SUCCESS. Commit: 4df69cb
/LLM/main/L0_MergeRequest_PR pipeline #31509 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-26T13:53:28Z

/bot run --disable-fail-fast

lancelly · 2026-03-26T13:54:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-26T13:59:03Z

PR_Github #40446 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

tensorrt-cicd · 2026-03-26T14:01:58Z

PR_Github #40447 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

tensorrt-cicd · 2026-03-26T21:17:46Z

PR_Github #40447 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31538 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

SimengLiu-nv

LGTM.

lancelly · 2026-03-27T02:20:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T02:25:50Z

PR_Github #40475 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

tensorrt-cicd · 2026-03-27T06:22:56Z

PR_Github #40475 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31562 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-27T07:19:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T07:26:08Z

PR_Github #40490 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

tensorrt-cicd · 2026-03-27T10:21:10Z

PR_Github #40490 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31578 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-27T10:26:32Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T10:32:29Z

PR_Github #40496 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

tensorrt-cicd · 2026-03-27T13:10:34Z

PR_Github #40496 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31583 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-28T02:51:36Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-28T02:57:50Z

PR_Github #40524 [ run ] triggered by Bot. Commit: a1d6c29 Link to invocation

tensorrt-cicd · 2026-03-28T05:15:54Z

PR_Github #40524 [ run ] completed with state SUCCESS. Commit: a1d6c29
/LLM/main/L0_MergeRequest_PR pipeline #31610 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@staticmethod

The mock class calls PyExecutor._waiting_requests with self as the mock instance, but _waiting_requests internally calls self._compute_scheduled_tokens which is a @staticmethod on PyExecutor. Without exposing it on the mock, Python's Mock auto-generates a Mock-returning attribute, causing numeric comparison failures. Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly · 2026-03-28T11:57:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-28T12:02:54Z

PR_Github #40540 [ run ] triggered by Bot. Commit: c42d178 Link to invocation

tensorrt-cicd · 2026-03-28T12:02:55Z

PR_Github #40540 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

fix

2bd5013

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly requested a review from a team as a code owner March 25, 2026 03:17

lancelly requested a review from achartier March 25, 2026 03:17

github-actions bot assigned lancelly Mar 25, 2026

coderabbitai bot reviewed Mar 25, 2026

View reviewed changes

tests/unittest/_torch/executor/test_py_executor.py Outdated Show resolved Hide resolved

fix pre-commit

fac40e3

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly requested a review from SimengLiu-nv March 25, 2026 06:45

SimengLiu-nv requested changes Mar 25, 2026

View reviewed changes

tests/unittest/_torch/executor/test_py_executor.py Show resolved Hide resolved

lancelly added 2 commits March 25, 2026 19:29

fix comments in UT

898d0b1

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

ffix-precommit

4df69cb

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

lancelly added 2 commits March 26, 2026 01:51

remove mirroring logic

d0583d7

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

fix precommit

a1d6c29

Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>

SimengLiu-nv approved these changes Mar 27, 2026

View reviewed changes

lancelly and others added 2 commits March 28, 2026 02:44

Merge branch 'main' into fix_delay_batching

c42d178

Conversation

lancelly commented Mar 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

coderabbitai bot commented Mar 25, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lancelly commented Mar 25, 2026

Uh oh!

lancelly commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

SimengLiu-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lancelly commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

lancelly commented Mar 26, 2026

Uh oh!

lancelly commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

SimengLiu-nv left a comment

Choose a reason for hiding this comment

Uh oh!

lancelly commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

lancelly commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

lancelly commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

lancelly commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

lancelly commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

lancelly commented Mar 25, 2026 •

edited by coderabbitai bot

Loading