[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse#12521
[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse#12521lancelly wants to merge 8 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughUpdated token accounting in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Line 223: The long boolean assignment to should_waiting exceeds line-length
limits; split the expression across lines using parentheses and intermediate
variables if helpful. Locate the assignment to should_waiting (uses
self.batch_wait_iters_count, self.batch_wait_timeout_iters,
num_scheduled_tokens, self.batch_wait_max_tokens_ratio, self.max_num_tokens) and
reformat it so each comparison and the multiplication are on separate lines (or
extract num_scheduled_tokens < ... into a named variable) to keep each line <120
chars while preserving the same logic and operands.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 29b96125-8bf0-4aad-81ad-0f1ef8ebff36
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_py_executor.py
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #40247 [ run ] triggered by Bot. Commit: |
… reuse Cherry-pick from PR NVIDIA#12521. _waiting_requests() was using full input sequence length (get_tokens(0)) which always exceeded the batch_wait threshold when KV cache reuse is enabled. Now subtracts estimated_reusable_tokens to get actual compute tokens. Signed-off-by: Lance Liao <laliao@login-preos02.a51.clusters.nvidia.com> Made-with: Cursor
|
PR_Github #40247 [ run ] completed with state
|
SimengLiu-nv
left a comment
There was a problem hiding this comment.
The changes in tensorrt_llm/_torch/pyexecutor/py_executor.py looks good to me. Request changes in the testing code.
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #40416 [ run ] triggered by Bot. Commit: |
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
PR_Github #40416 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #40446 [ run ] triggered by Bot. Commit: |
|
PR_Github #40447 [ run ] triggered by Bot. Commit: |
|
PR_Github #40447 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40475 [ run ] triggered by Bot. Commit: |
|
PR_Github #40475 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40490 [ run ] triggered by Bot. Commit: |
|
PR_Github #40490 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40496 [ run ] triggered by Bot. Commit: |
|
PR_Github #40496 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40524 [ run ] triggered by Bot. Commit: |
|
PR_Github #40524 [ run ] completed with state
|
The mock class calls PyExecutor._waiting_requests with self as the mock instance, but _waiting_requests internally calls self._compute_scheduled_tokens which is a @staticmethod on PyExecutor. Without exposing it on the mock, Python's Mock auto-generates a Mock-returning attribute, causing numeric comparison failures. Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #40540 [ run ] triggered by Bot. Commit: |
|
PR_Github #40540 [ run ] completed with state |
Description
_waiting_requests()inpy_executor.pyusesget_tokens(0)to compute the scheduled context token count for the batch_wait decision.get_tokens(0)returns the full inputsequence length (e.g., ~34K tokens for agentic workloads), which always exceeds the
batch_wait_max_tokens_ratio * max_num_tokensthreshold. This makesbatch_wait_timeout_itersacomplete no-op when KV cache block reuse is enabled.
Fix: Subtract
estimated_reusable_tokens(already set by the capacity scheduler's radix tree lookup) to get the actual compute tokens, matching how the micro batch scheduleritself accounts for cache reuse.
Before (always False → never waits):