Commit 99c68da
Lance Liao
[None][fix] Fix _waiting_requests to use compute tokens with KV cache reuse
Cherry-pick from PR NVIDIA#12521. _waiting_requests() was using full input
sequence length (get_tokens(0)) which always exceeded the batch_wait
threshold when KV cache reuse is enabled. Now subtracts
estimated_reusable_tokens to get actual compute tokens.
Signed-off-by: Lance Liao <laliao@login-preos02.a51.clusters.nvidia.com>
Made-with: Cursor1 parent da0c004 commit 99c68da
1 file changed
+5
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2730 | 2730 | | |
2731 | 2731 | | |
2732 | 2732 | | |
2733 | | - | |
2734 | | - | |
| 2733 | + | |
| 2734 | + | |
| 2735 | + | |
| 2736 | + | |
| 2737 | + | |
2735 | 2738 | | |
2736 | 2739 | | |
2737 | 2740 | | |
| |||
0 commit comments