Skip to content

Commit 99c68da

Browse files
author
Lance Liao
committed
[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse
Cherry-pick from PR NVIDIA#12521. _waiting_requests() was using full input sequence length (get_tokens(0)) which always exceeded the batch_wait threshold when KV cache reuse is enabled. Now subtracts estimated_reusable_tokens to get actual compute tokens. Signed-off-by: Lance Liao <laliao@login-preos02.a51.clusters.nvidia.com> Made-with: Cursor
1 parent da0c004 commit 99c68da

File tree

1 file changed

+5
-2
lines changed

1 file changed

+5
-2
lines changed

tensorrt_llm/_torch/pyexecutor/py_executor.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2730,8 +2730,11 @@ def _waiting_requests(self, context_requests: list[LlmRequest],
27302730
- The number of waiting iterations is smaller than `self.batch_wait_timeout_iters`.
27312731
"""
27322732

2733-
num_scheduled_ctx_tokens = sum(
2734-
len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
2733+
num_scheduled_ctx_tokens = 0
2734+
for ctx_req in context_requests:
2735+
req_tokens = len(ctx_req.get_tokens(0))
2736+
reusable = ctx_req.estimated_reusable_tokens if ctx_req.is_first_context_chunk else 0
2737+
num_scheduled_ctx_tokens += max(1, req_tokens - reusable)
27352738
num_scheduled_gen_tokens = sum(1 + gen_req.num_draft_tokens
27362739
for gen_req in generation_requests)
27372740
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens

0 commit comments

Comments
 (0)