-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 #11503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yao Yao <[email protected]>
Signed-off-by: Yao Yao <[email protected]>
Signed-off-by: Yao Yao <[email protected]>
Signed-off-by: Yao Yao <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #35878 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughComprehensive refactoring of KVCacheManagerV2 system from static to dynamic eviction-aware controller. Key changes include: removing BufferSlice from public API in favor of BufferId; introducing moving average statistics (Average, MovingAverage); expanding KVCacheManager with lifecycle tracking, shutdown, and ratio-based adjustment logic; refactoring storage layer to use TypedIndexList typed collections; implementing mmap-based memory allocation; adding eviction policy iteration support; and updating test infrastructure for per-head token handling. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant KVM as KVCacheManager
participant SM as StorageManager
participant EC as EvictionController
participant PG as PoolGroup
App->>KVM: need_adjustment()
KVM->>KVM: Compare current vs target<br/>ratios per level
alt Adjustment needed
App->>KVM: adjust()
KVM->>KVM: _try_update_target_ratios()
KVM->>KVM: Get living cache stats<br/>(history, capacity)
KVM->>KVM: _adjust_level(GPU_LEVEL)
KVM->>EC: page_iterator(pool_group_idx)
EC-->>KVM: Iterator[EvictablePage]
KVM->>KVM: _gather_persistent_pages()
KVM->>SM: shrink_pool_group() or<br/>expand_pool_group()
SM->>PG: Update pool capacity
SM->>PG: Allocate/deallocate slots
PG-->>SM: Confirmation
SM-->>KVM: Complete
KVM->>KVM: Repeat for HOST_LEVEL,<br/>DISK_LEVEL as needed
end
KVM-->>App: Adjustment complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py (1)
752-795:⚠️ Potential issue | 🟡 Minor
_ratio_to_slot_count_list: potential division by zero if remaining ratios sum to zero.Line 774:
pct = ratio_list[pg] / sum(ratio_list[j] for j in pg_idx_lst[i:])will raiseZeroDivisionErrorif all remaining ratios are 0. While unlikely in practice, callers could pass degenerate ratio lists. Consider adding a guard or an upfront assertion that all ratios are positive.tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py (1)
355-361:⚠️ Potential issue | 🟡 MinorMissing
f-prefix on error message strings —{pg_idx}and{goal}will appear literally.These strings use
{pg_idx}and{goal}but are not f-strings, so the variable values won't be interpolated. While these are pre-existing lines, they're in a method modified by this PR.Proposed fix
raise OutOfPagesError( - "Too many held pages are being evicted to the last-level cache for group {pg_idx}" + f"Too many held pages are being evicted to the last-level cache for group {pg_idx}" ) if old_free_cnt + evictable_cnt - fallen_held_cnt < goal: raise OutOfPagesError( - "Impossible to meet the goal ({goal} free slots) for group {pg_idx}" + f"Impossible to meet the goal ({goal} free slots) for group {pg_idx}" )
🤖 Fix all issues with AI agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`:
- Around line 457-470: In ratio_from_length, protect against total == 0 before
computing x / total to avoid ZeroDivisionError: after computing total =
sum(num_bytes) add a guard that if total == 0 return a TypedIndexList where each
pool group gets an even share (e.g., 1.0 / num_pool_groups) or explicit zeros as
appropriate for downstream code; implement this using typed_fill/filled_list or
typed_map to produce the fallback list and otherwise proceed with return
typed_map(num_bytes, lambda x: x / total). Ensure you reference
ratio_from_length, num_bytes, total, num_pool_groups, and typed_map when making
the change.
- Around line 407-425: _gather_persistent_pages incorrectly asserts all living
caches are SUSPENDED which fails when _adjust_level is invoked from resize()
while caches may be ACTIVE; change the behavior in _gather_persistent_pages
(refer to function _gather_persistent_pages and symbol _living_kv_caches) to not
assert kv_cache.status == _KVCache.Status.SUSPENDED—either skip entries where
kv_cache.status != _KVCache.Status.SUSPENDED (continue) or explicitly handle
ACTIVE caches (e.g., suspend them safely before iterating) so the function no
longer raises on resize() calls that run this path.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py`:
- Around line 1-6: This file is missing the required NVIDIA Apache-2.0 copyright
header; add the standard NVIDIA copyright and Apache-2.0 license header (with
the year of latest meaningful modification) at the top of the file before the
class MovingAverage definition so the file contains the required license text
and copyright attribution.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py`:
- Around line 509-520: The destroy() method currently returns early when
allocator._capacity == 0 and therefore never destroys the pools or sets
self._destroyed; change the logic in destroy() (method name: destroy, fields:
_slot_allocator, _pools, _destroyed, allocator._capacity) so that when capacity
== 0 you skip allocator-specific calls
(synchronize/prepare_for_shrink/finish_shrink) but still iterate over
self._pools and call pool.destroy() and then set self._destroyed = True before
returning; ensure the normal path still does allocator._synchronize() and
allocator.prepare_for_shrink/finish_shrink and sets _destroyed at the end.
- Around line 681-691: The ratio_list method can divide by zero when total == 0;
update ratio_list (in _core.py) to check if total is zero before the division
loop and handle that case (for example, return the zero-filled list `ret`
immediately or otherwise avoid the division) so no ZeroDivisionError occurs;
keep the rest of the logic the same and reference the existing symbols:
ratio_list, num_pool_groups, self._pool_groups, total, and ret.
- Around line 395-402: In expand (method expand) add a guard to ensure you don't
discard an in-progress shrink: before resizing and setting
self._target_capacity, assert or raise unless self._target_capacity ==
self._capacity (the same check used in prepare_for_shrink), so expand
refuses/alerts if a shrink is active and thus avoids silently orphaning
self._overflow_slots; locate the symbols expand, _target_capacity, _capacity,
_overflow_slots and prepare_for_shrink and implement the check (or explicit
handling) to preserve or explicitly cancel any active shrink instead of
unconditionally resetting _target_capacity.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Line 346: The CDLL load is missing use_errno=True so ctypes.get_errno()
returns stale/zero values; update the _libc initialization (the ctypes.CDLL call
that assigns _libc) to pass use_errno=True so errno is captured for subsequent
calls in _mmap, _munmap, _mremap, and _madvise, ensuring those functions read
the correct error codes via ctypes.get_errno().
In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c`:
- Around line 94-103: ReferenceType_hash currently returns self->object_id
directly which can be -1, but tp_hash uses -1 to signal errors; update
ReferenceType_hash to detect when (Py_hash_t)self->object_id == -1 and remap
that value to -2 before returning. Locate the ReferenceType_hash function
operating on ReferenceTypeObject and ensure it still raises a RuntimeError when
self->object_id == 0, but after computing the hash cast to Py_hash_t, replace -1
with -2 to comply with the CPython hashing protocol.
🧹 Nitpick comments (13)
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py (3)
264-265: Variable shadowing: loop variableIndexshadows the module-levelTypeVar.The list comprehension
[generator(Index) for Index in typed_range(count)]usesIndexas the loop variable, which shadows theTypeVar("Index")defined at line 60. This works at runtime but violates the coding guideline to "avoid shadowing variables declared in an outer scope" and is confusing to readers.Proposed fix
def make_typed(generator: Callable[[Index], T], count: Index) -> TypedIndexList[Index, T]: - return cast(TypedIndexList[Index, T], [generator(Index) for Index in typed_range(count)]) + return cast(TypedIndexList[Index, T], [generator(i) for i in typed_range(count)])As per coding guidelines: "Avoid shadowing variables declared in an outer scope".
374-377: Missingstacklevelinwarnings.warn.Per Ruff B028,
warnings.warnwithoutstackleveldefaults to 1, which points to this helper function rather than the caller. Usestacklevel=2so the warning points to the actual call site.Proposed fix
warnings.warn( f"madvise failed with errno {error_code}: {errno.errorcode.get(error_code, 'Unknown error')}" + , stacklevel=2 )
406-410: Missingstacklevel=2in_munmapwarning.Same issue as
_madvise— the warning will attribute to this function rather than the caller.Proposed fix
- warnings.warn(f"munmap failed with errno {error_code}") + warnings.warn(f"munmap failed with errno {error_code}", stacklevel=2)tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py (1)
38-40:Average.valuewill raiseZeroDivisionErrorif called before anyupdate().Current usage always calls
update()before accessingvalue, so this is safe in practice. However, a guard would make the class more robust.Proposed defensive fix
`@property` def value(self) -> float: + if self.count == 0: + return 0.0 return self.sum / self.counttensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py (1)
944-944: Address the FIXME: unused_parameter inget_num_matched_tokens.The
#@fixme: remove the _ parametercomment suggests this is a known workaround. The function is called withmatchedas an argument (lines 978, 1001, 1026) but ignores it — it capturesmatchedfrom the enclosing scope via closure instead.Would you like me to open an issue to track cleaning up this parameter?
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py (1)
1097-1098: LocalGPU_LEVELshadows the module-level import.
GPU_LEVELis already imported at module scope (line 52/93). The local redefinitionGPU_LEVEL = CacheLevel(0)creates the same value but shadows the import, which the coding guidelines recommend avoiding.HOST_LEVELis fine since it's not imported.Proposed fix
- GPU_LEVEL = CacheLevel(0) HOST_LEVEL = CacheLevel(1) # Shrink the gpu quotatensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py (1)
384-395:zip()withoutstrict=Trueon ratio comparison.Line 390 zips two
TypedIndexList[PoolGroupIndex, float]withoutstrict=True. These should always have equal length, so addingstrict=Truewould catch bugs early (as Ruff B905 also flags).Proposed fix
- return any(not (1 / thres < x / y < thres) for x, y in zip(a, b)) + return any(not (1 / thres < x / y < thres) for x, y in zip(a, b, strict=True))tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py (3)
647-650: Unused__init__parameterstotal_quotaandratio_list.
CacheLevelStorage.__init__acceptstotal_quotaandratio_listbut never uses them. All subclasses pass these values viasuper().__init__(), creating a false impression that the base class stores or validates them. Consider either removing these parameters or adding validation logic here (e.g., assertingtotal_quota > 0,sum(ratio_list) ≈ 1.0).Option A: Remove unused params
- def __init__(self, total_quota: int, ratio_list: TypedIndexList[PoolGroupIndex, float]) -> None: + def __init__(self) -> None: if not hasattr(self.__class__, "TIER"): raise ValueError(f"{self.__class__.__name__} must define 'TIER' as a class variable")This would require updating all
super().__init__(total_quota, init_ratio)calls in subclasses.
429-451:finish_shrinkreturn type is misleading — it returnsTrueor raises, neverFalse.The
-> boolreturn type suggests the caller can check success/failure, but the method always returnsTrueon success and raisesRuntimeErroron failure. This can mislead callers into writingif not allocator.finish_shrink(): ...which is dead code.Consider changing the return type to
-> None(and removingreturn True), or to-> Literal[True], to accurately signal the contract.Proposed fix (return None)
- def finish_shrink(self) -> bool: + def finish_shrink(self) -> None: assert NDEBUG or self._check() if ( self.shrink_in_progress and self._target_capacity + len(self._overflow_slots) == self._num_active_slots ): ... self._scrub_events() assert NDEBUG or self._check() - return True + return raise RuntimeError("shrink can't be finished")
489-493: Potential infinite loop in_synchronize.
_synchronizespins until all recycled slots are ready, but_scrub_eventsonly advances_num_ready_recycled_slotsby scanning forward from the current ready frontier and stopping at the first non-ready event. If a slot in the middle has a long-running or stuck CUDA event, this will busy-wait indefinitely with no timeout and no backoff.This may be intentional for a synchronization primitive, but consider adding a safety timeout or at minimum a
time.sleepto avoid CPU burn.tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py (3)
253-266: Exception handler innew_slots_for_pool_grouphas no cleanup — unlikenew_slots.In
new_slots(line 244), the except block releases all partially allocated slots. Here (line 264), the except block only warns and re-raises. Sinceallocate_multipleis all-or-none and the assertion on line 261 should guarantee success, this is likely safe. But if the intent is a safety net, it should match the pattern innew_slots.Also, per Ruff B028,
warnings.warnshould includestacklevel=2so the warning points to the caller.Proposed fix
try: return storage.allocate_multiple(pg_idx, num_slots) except Exception: - warnings.warn("Exception not expected here. Please report a bug.") + warnings.warn("Exception not expected here. Please report a bug.", stacklevel=2) raise
573-621:shrink_pool_group— complex logic with subtle eviction/defrag interplay; a few observations.
Line 600-603: The while-loop condition
len(overflow_slots) + num_overflow_persistent > min(new_num_slots, overflow_slots[0][0] + allocator.num_free_slots)relies on the eviction controller's iterator index (overflow_slots[0][0]) corresponding to the eviction order used byforce_evict. If these orderings diverge, the computedmin_num_evictedwill be wrong, potentially leaving the shrink unable to complete. This coupling is fragile and deserves a comment or assertion.Line 616-620: The assertion
len(allocator._overflow_slots) == allocator._num_active_slots - allocator._target_capacitydirectly accesses allocator internals. Consider exposing a method onSlotAllocator(e.g.,is_ready_to_finish_shrink()) to encapsulate this invariant check.Suggestion: Add a clarifying comment for the eviction loop
allocator.prepare_for_shrink(new_num_slots) min_num_evicted = 0 + # Evict pages from the front of the eviction order until all remaining + # overflow pages (evictable + persistent) can be relocated into [0, new_num_slots). while overflow_slots and len(overflow_slots) + num_overflow_persistent > min( new_num_slots, overflow_slots[0][0] + allocator.num_free_slots ): min_num_evicted = overflow_slots.popleft()[0] + 1
631-663:adjust_cache_levelshrinks before expanding — good memory hygiene.Shrinking first frees resources that may be needed for expanding other pool groups. The round-up of
new_quotaon line 645 ensures alignment with pool size granularity.One note: if
shrink_pool_groupfails (raises) for one pool group, subsequent groups won't be processed, leaving the cache level in a partially resized state. The caller should be aware of this non-atomic behavior. A brief docstring note would help.Docstring enhancement
def adjust_cache_level( self, level: CacheLevel, new_quota: int | None, new_ratio_list: TypedIndexList[PoolGroupIndex, float], persistent_pages: TypedIndexList[PoolGroupIndex, list[Page]] | None = None, ) -> None: - """Adapt the cache level by adjusting the ratio list. Persistent pages are those held and not evictable.""" + """Adapt the cache level by adjusting the ratio list. Persistent pages are those held and not evictable. + + Note: This operation is not atomic. If a shrink or expand fails for one pool group, + the cache level may be left in a partially resized state. + """
| def _gather_persistent_pages(self) -> TypedIndexList[PoolGroupIndex, list[Page]]: | ||
| last_level = self._storage.num_cache_levels - 1 | ||
| lc2pg = self._storage._life_cycle_grouping | ||
| ret = make_typed(lambda _: list[Page](), self._storage.num_pool_groups) | ||
| for r in self._living_kv_caches: | ||
| kv_cache = unwrap_rawref(r) | ||
| assert kv_cache.status == _KVCache.Status.SUSPENDED | ||
| for block in kv_cache._blocks: | ||
| for beam in block.pages: | ||
| for lc, holder in typed_enumerate(beam): | ||
| if holder is None: | ||
| continue | ||
| assert type(holder) is _PageHolder | ||
| page = holder.page | ||
| assert page.status == PageStatus.HELD | ||
| assert page.scheduled_for_eviction == (page.cache_level != last_level) | ||
| if not page.scheduled_for_eviction: | ||
| ret[lc2pg[lc]].append(holder.page) | ||
| return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_gather_persistent_pages asserts all living caches are SUSPENDED.
This is correct for the adjust() path (line 438-439 also asserts this), but _adjust_level is also called from resize() (line 242) where caches may be ACTIVE. If level == num_cache_levels - 1 in a resize() call, _gather_persistent_pages would fire and could hit the assertion on line 413.
#!/bin/bash
# Verify whether resize() can be called when caches are active and whether
# the last-level path in _adjust_level is reachable from resize()
rg -n '_adjust_level\|_gather_persistent_pages\|def resize' --type=py -g '*_kv_cache_manager*'🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py` around
lines 407 - 425, _gather_persistent_pages incorrectly asserts all living caches
are SUSPENDED which fails when _adjust_level is invoked from resize() while
caches may be ACTIVE; change the behavior in _gather_persistent_pages (refer to
function _gather_persistent_pages and symbol _living_kv_caches) to not assert
kv_cache.status == _KVCache.Status.SUSPENDED—either skip entries where
kv_cache.status != _KVCache.Status.SUSPENDED (continue) or explicitly handle
ACTIVE caches (e.g., suspend them safely before iterating) so the function no
longer raises on resize() calls that run this path.
| def ratio_from_length( | ||
| history_length: int, capacity: int | ||
| ) -> TypedIndexList[PoolGroupIndex, float]: | ||
| num_blocks = div_up(capacity, tokens_per_blocks) | ||
| num_bytes = filled_list(0.0, num_pool_groups) | ||
| for lc_idx, lc in typed_enumerate(life_cycles): | ||
| stale_beg, stale_end = _KVCache._get_stale_range( | ||
| tokens_per_blocks, history_length, lc | ||
| ) | ||
| pg_idx = lc2pg[lc_idx] | ||
| slot_size = storage.slot_size(pg_idx) | ||
| num_bytes[pg_idx] += (num_blocks - (stale_end - stale_beg)) * sum(slot_size) | ||
| total = sum(num_bytes) | ||
| return typed_map(num_bytes, lambda x: x / total) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential ZeroDivisionError in ratio_from_length when total == 0.
On line 470, x / total will raise ZeroDivisionError if all pool groups end up with zero effective bytes (e.g., all blocks are stale). While this is unlikely in practice for avg_capacity > 0, it's worth adding a guard.
Proposed guard
total = sum(num_bytes)
+ if total == 0:
+ return filled_list(1.0 / num_pool_groups, num_pool_groups)
return typed_map(num_bytes, lambda x: x / total)🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py` around
lines 457 - 470, In ratio_from_length, protect against total == 0 before
computing x / total to avoid ZeroDivisionError: after computing total =
sum(num_bytes) add a guard that if total == 0 return a TypedIndexList where each
pool group gets an even share (e.g., 1.0 / num_pool_groups) or explicit zeros as
appropriate for downstream code; implement this using typed_fill/filled_list or
typed_map to produce the fallback list and otherwise proceed with return
typed_map(num_bytes, lambda x: x / total). Ensure you reference
ratio_from_length, num_bytes, total, num_pool_groups, and typed_map when making
the change.
| class MovingAverage: | ||
| __slots__ = ("decay", "avg", "weight", "num_updates") | ||
| decay: float | ||
| avg: float | ||
| weight: float | ||
| num_updates: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing NVIDIA copyright header.
This is a new file and must include the NVIDIA Apache 2.0 copyright header. As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
class MovingAverage:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class MovingAverage: | |
| __slots__ = ("decay", "avg", "weight", "num_updates") | |
| decay: float | |
| avg: float | |
| weight: float | |
| num_updates: int | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| class MovingAverage: | |
| __slots__ = ("decay", "avg", "weight", "num_updates") | |
| decay: float | |
| avg: float | |
| weight: float | |
| num_updates: int |
🧰 Tools
🪛 Ruff (0.15.0)
[warning] 2-2: MovingAverage.__slots__ is not sorted
Apply a natural sort to MovingAverage.__slots__
(RUF023)
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py` around
lines 1 - 6, This file is missing the required NVIDIA Apache-2.0 copyright
header; add the standard NVIDIA copyright and Apache-2.0 license header (with
the year of latest meaningful modification) at the top of the file before the
class MovingAverage definition so the file contains the required license text
and copyright attribution.
| def expand(self, new_num_slots: int) -> None: | ||
| assert NDEBUG or self._check() | ||
| old_num_slots = self.num_slots | ||
| if new_num_slots < self.num_slots and self._occupied_mask.any_set( | ||
| new_num_slots, self.num_slots | ||
| ): | ||
| raise ResourceBusyError("resize cannot remove occupied slots") | ||
| old_num_slots = self._capacity | ||
| assert new_num_slots > old_num_slots | ||
| self._occupied_mask.resize(new_num_slots) | ||
| self._capacity = new_num_slots | ||
| self._num_active_slots = min(self._num_active_slots, new_num_slots) | ||
| if new_num_slots < old_num_slots: | ||
| new_recycled_slots = deque[Slot]() | ||
| new_num_ready_recycled_slots = 0 | ||
| for idx_recycled, slot in enumerate(self._recycled_slots): | ||
| assert type(slot) is Slot and slot.has_valid_slot | ||
| if slot.slot_id >= new_num_slots: | ||
| slot.ready_event.synchronize() | ||
| slot._slot_id = None | ||
| slot.ready_event = CachedCudaEvent.NULL | ||
| else: | ||
| new_recycled_slots.append(slot) | ||
| if idx_recycled < self._num_ready_recycled_slots: | ||
| new_num_ready_recycled_slots += 1 | ||
| self._recycled_slots = new_recycled_slots | ||
| self._num_ready_recycled_slots = new_num_ready_recycled_slots | ||
| self._scrub_events() | ||
| self._target_capacity = self._capacity | ||
| assert NDEBUG or self._check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expand resets _target_capacity — verify no in-progress shrink is active.
expand unconditionally sets _target_capacity = _capacity on line 401, but there's no assertion that _target_capacity == _capacity before expanding (unlike prepare_for_shrink which asserts it on line 406). If expand is called while a shrink is in progress, it would silently discard the shrink state and orphan any _overflow_slots.
Proposed fix: add a guard
def expand(self, new_num_slots: int) -> None:
assert NDEBUG or self._check()
old_num_slots = self._capacity
assert new_num_slots > old_num_slots
+ assert not self.shrink_in_progress, "Cannot expand while a shrink is in progress"
self._occupied_mask.resize(new_num_slots)
self._capacity = new_num_slots
self._target_capacity = self._capacity
assert NDEBUG or self._check()🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 395
- 402, In expand (method expand) add a guard to ensure you don't discard an
in-progress shrink: before resizing and setting self._target_capacity, assert or
raise unless self._target_capacity == self._capacity (the same check used in
prepare_for_shrink), so expand refuses/alerts if a shrink is active and thus
avoids silently orphaning self._overflow_slots; locate the symbols expand,
_target_capacity, _capacity, _overflow_slots and prepare_for_shrink and
implement the check (or explicit handling) to preserve or explicitly cancel any
active shrink instead of unconditionally resetting _target_capacity.
| def destroy(self) -> None: | ||
| if self._slot_allocator._capacity == 0: | ||
| if self._destroyed: | ||
| return | ||
| allocator = self._slot_allocator | ||
| if allocator._capacity == 0: | ||
| return | ||
| self._slot_allocator._synchronize() | ||
| allocator._synchronize() | ||
| for pool in self._pools: | ||
| pool.destroy() | ||
| self._slot_allocator.resize(0) | ||
| allocator.prepare_for_shrink(0) | ||
| allocator.finish_shrink() | ||
| self._destroyed = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destroy() skips pool cleanup when _capacity == 0.
Line 513-514: if allocator._capacity == 0, destroy() returns before destroying pools or marking _destroyed = True. On a subsequent call, _destroyed is still False, so it enters again and hits the same early return — pools are never destroyed and _destroyed is never set. This can leak host/disk resources in edge cases where a PoolGroupBase was constructed with num_slots=0.
Proposed fix
def destroy(self) -> None:
if self._destroyed:
return
allocator = self._slot_allocator
if allocator._capacity == 0:
+ for pool in self._pools:
+ pool.destroy()
+ self._destroyed = True
return
allocator._synchronize()
for pool in self._pools:
pool.destroy()
allocator.prepare_for_shrink(0)
allocator.finish_shrink()
self._destroyed = True🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 509
- 520, The destroy() method currently returns early when allocator._capacity ==
0 and therefore never destroys the pools or sets self._destroyed; change the
logic in destroy() (method name: destroy, fields: _slot_allocator, _pools,
_destroyed, allocator._capacity) so that when capacity == 0 you skip
allocator-specific calls (synchronize/prepare_for_shrink/finish_shrink) but
still iterate over self._pools and call pool.destroy() and then set
self._destroyed = True before returning; ensure the normal path still does
allocator._synchronize() and allocator.prepare_for_shrink/finish_shrink and sets
_destroyed at the end.
| def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]: | ||
| num_pool_groups = self.num_pool_groups | ||
| ret = filled_list(0.0, num_pool_groups) | ||
| total = 0 | ||
| for i, pg in typed_enumerate(self._pool_groups): | ||
| size = pg.num_bytes | ||
| total += size | ||
| ret[i] = size | ||
| for i in typed_range(num_pool_groups): | ||
| ret[i] /= total | ||
| return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero when all pools have zero bytes.
If total is 0 (e.g., after a destroy() or when all pool groups are empty), ret[i] /= total on line 690 will raise ZeroDivisionError.
Proposed fix
for i, pg in typed_enumerate(self._pool_groups):
size = pg.num_bytes
total += size
ret[i] = size
- for i in typed_range(num_pool_groups):
- ret[i] /= total
+ if total > 0:
+ for i in typed_range(num_pool_groups):
+ ret[i] /= total
+ else:
+ # Equal ratios when no bytes are allocated
+ uniform = 1.0 / int(num_pool_groups) if int(num_pool_groups) > 0 else 0.0
+ for i in typed_range(num_pool_groups):
+ ret[i] = uniform
return ret📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]: | |
| num_pool_groups = self.num_pool_groups | |
| ret = filled_list(0.0, num_pool_groups) | |
| total = 0 | |
| for i, pg in typed_enumerate(self._pool_groups): | |
| size = pg.num_bytes | |
| total += size | |
| ret[i] = size | |
| for i in typed_range(num_pool_groups): | |
| ret[i] /= total | |
| return ret | |
| def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]: | |
| num_pool_groups = self.num_pool_groups | |
| ret = filled_list(0.0, num_pool_groups) | |
| total = 0 | |
| for i, pg in typed_enumerate(self._pool_groups): | |
| size = pg.num_bytes | |
| total += size | |
| ret[i] = size | |
| if total > 0: | |
| for i in typed_range(num_pool_groups): | |
| ret[i] /= total | |
| else: | |
| # Equal ratios when no bytes are allocated | |
| uniform = 1.0 / int(num_pool_groups) if int(num_pool_groups) > 0 else 0.0 | |
| for i in typed_range(num_pool_groups): | |
| ret[i] = uniform | |
| return ret |
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 681
- 691, The ratio_list method can divide by zero when total == 0; update
ratio_list (in _core.py) to check if total is zero before the division loop and
handle that case (for example, return the zero-filled list `ret` immediately or
otherwise avoid the division) so no ZeroDivisionError occurs; keep the rest of
the logic the same and reference the existing symbols: ratio_list,
num_pool_groups, self._pool_groups, total, and ret.
| MREMAP_MAYMOVE: Final[int] = 1 | ||
| MAP_FAILED: Final[int] = -1 | ||
|
|
||
| _libc = ctypes.CDLL(find_library("c")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file and find the context around line 346
head -400 tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py | tail -100Repository: NVIDIA/TensorRT-LLM
Length of output: 3048
🏁 Script executed:
# Search for all uses of ctypes.get_errno() in the file
rg "ctypes\.get_errno|get_errno" tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py -nRepository: NVIDIA/TensorRT-LLM
Length of output: 238
🏁 Script executed:
# Check the entire file structure to understand the functions mentioned
rg "_mmap|_munmap|_mremap|_madvise" tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py -n -A 5Repository: NVIDIA/TensorRT-LLM
Length of output: 2113
🌐 Web query:
Python ctypes CDLL use_errno parameter get_errno behavior
💡 Result:
ctypes.CDLL(..., use_errno=True) (and similarly CFUNCTYPE(..., use_errno=True)) does not make ctypes.get_errno() read the process’s live C errno directly. Instead, it enables an errno swapping mechanism around that specific ctypes call:
ctypeskeeps a thread-local “private copy” oferrno.- For a foreign function invoked via ctypes that was created with
use_errno=True:- Before the call: the OS/C
errnois swapped with ctypes’ private copy. - After the call: they’re swapped again, so ctypes’ private copy now holds the
errnovalue produced by the C call.
Thenctypes.get_errno()returns that private copy. [1]
- Before the call: the OS/C
So the practical behavior is:
- If you call a ctypes function without
use_errno=True, ctypes will typically not capture/update its private errno copy for that call, andctypes.get_errno()may show an old value. [1] ctypes.get_errno()will not reflecterrnoset by non-ctypes operations (e.g., Python’sos.open()), because it reads ctypes’ private copy, not the liveerrno. [1][2]- Setting
use_errno=Trueon the library (CDLL) tends to apply as the default for functions you access as attributes from that library object (and adds overhead to each call), whereas you can also enable it only for specific callables via function prototypes. [1][2]
Sources: Python ctypes docs on use_errno, get_errno(), and the swap mechanism [1]; explanation/usage notes illustrating that get_errno() reads ctypes’ private copy (and needs use_errno=True on the ctypes call) [2].
[1] (docs.python.org)
[2] (stackoverflow.com)
Citations:
- 1: https://docs.python.org/3/library/ctypes.html?utm_source=openai
- 2: https://stackoverflow.com/questions/78966115/how-to-correctly-use-ctypes-get-errno?utm_source=openai
Critical bug: ctypes.CDLL missing use_errno=True — ctypes.get_errno() will always return stale/zero values.
Without use_errno=True, ctypes does not capture errno values from C function calls. Since the code calls ctypes.get_errno() in _mmap, _munmap, _mremap, and _madvise to retrieve error codes, all error diagnostics will be unreliable and display incorrect values.
Proposed fix
-_libc = ctypes.CDLL(find_library("c"))
+_libc = ctypes.CDLL(find_library("c"), use_errno=True)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _libc = ctypes.CDLL(find_library("c")) | |
| _libc = ctypes.CDLL(find_library("c"), use_errno=True) |
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py` at line 346, The CDLL
load is missing use_errno=True so ctypes.get_errno() returns stale/zero values;
update the _libc initialization (the ctypes.CDLL call that assigns _libc) to
pass use_errno=True so errno is captured for subsequent calls in _mmap, _munmap,
_mremap, and _madvise, ensuring those functions read the correct error codes via
ctypes.get_errno().
| /* ReferenceType.__hash__ */ | ||
| static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self) | ||
| { | ||
| if (self->object_id == 0) | ||
| { | ||
| PyErr_SetString(PyExc_RuntimeError, "Reference is invalid"); | ||
| return -1; | ||
| } | ||
| return (Py_hash_t) self->object_id; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
CPython tp_hash protocol -1 return value error convention
💡 Result:
In CPython, a type’s tp_hash slot has the same contract as PyObject_Hash():
-1is reserved to mean “error”. Atp_hashimplementation must not return-1as a normal hash value; if hashing fails, it must set an exception and return-1. [1]- Correspondingly,
PyObject_Hash(o)returns-1on failure (with an exception set), andPyObject_HashNotImplemented(o)setsTypeError(“unhashable type”) and returns-1(and is specially recognized when stored intp_hash). [2]
Because -1 is the error sentinel, CPython ensures a “real” hash is never -1: if the computed hash would be -1, CPython remaps it (commonly to -2) in core hash implementations (e.g., the integer hash code path shown here). [3]
Practical rule when writing tp_hash: if your computed hash equals -1, return -2 (or any other non--1 value) instead; reserve -1 strictly for signaling an exception. [1][3]
🌐 Web query:
CPython long_hash unicode_hash -1 handling implementation
💡 Result:
In CPython, Py_hash_t returning -1 is reserved to mean “error occurred” in the tp_hash / PyObject_Hash() protocol, so CPython must ensure no successful hash computation ever returns -1.
long_hash (hash of int)
Objects/longobject.c:long_hash() computes an integer hash and then rewrites -1 to -2 before returning: if (x == (Py_uhash_t)-1) x = (Py_uhash_t)-2;. This is why hash(-1) == -2 on CPython. [1]
unicode_hash (hash of str)
Objects/unicodeobject.c:unicode_hash() uses the cached per-string hash value; the cache sentinel for “not yet computed” is -1 (if (hash != -1) return hash;). When the hash is computed, it calls the core bytes/string hashing routine (Py_HashBuffer / _Py_HashBytes) and stores the result. [2]
Where the “never return -1” rule is enforced for string/bytes hashing
The core helper _Py_HashBytes() explicitly ensures a computed hash value of -1 is converted to -2 (it returns -2 in that case). [3]
If a Python-level __hash__ returns -1
Even for user-defined classes, the slot wrapper in Objects/typeobject.c applies the same rule: “-1 is reserved for errors” and is rewritten to -2. [4]
Sources:
[1] Objects/longobject.c (long_hash) showing -1 → -2 remap. (chromium.googlesource.com)
[2] Objects/unicodeobject.c (unicode_hash) showing cached-hash sentinel -1 and computing via Py_HashBuffer / _Py_HashBytes. (chromium.googlesource.com)
[3] Python/pyhash.c (_Py_HashBytes) showing -1 → -2 handling. (sources.debian.org)
[4] Objects/typeobject.c (slot_tp_hash) rewriting h == -1 to -2. (chromium.googlesource.com)
Citations:
- 1: https://chromium.googlesource.com/external/github.com/python/cpython/%2B/refs/tags/v3.7.17/Objects/longobject.c
- 2: https://chromium.googlesource.com/external/github.com/python/cpython/%2B/refs/heads/main/Objects/unicodeobject.c?utm_source=openai
- 3: https://sources.debian.org/src/python3.7/3.7.3-2%2Bdeb10u3/Python/pyhash.c/?utm_source=openai
- 4: https://chromium.googlesource.com/external/github.com/python/cpython/%2B/refs/tags/v3.10.0a2/Objects/typeobject.c?utm_source=openai
🏁 Script executed:
# Search for real examples of how CPython hash functions handle -1
# Let's check if there are any relevant Python documentation or examples
python3 <<'EOF'
import sys
print(f"Python version: {sys.version}")
# Test if -1 is special in hash
try:
# Create a custom object with hash returning -1
class BadHash:
def __hash__(self):
return -1
obj = BadHash()
try:
h = hash(obj)
print(f"hash() with -1 return: {h}")
except Exception as e:
print(f"hash() with -1 return raises: {type(e).__name__}: {e}")
except Exception as e:
print(f"Error in test: {e}")
# Check if -1 is actually reserved
print(f"\nChecking -1 as hash value:")
d = {}
try:
d[-1] = "test"
print(f"Can use -1 as dict key: yes")
print(f"Value: {d[-1]}")
except Exception as e:
print(f"Cannot use -1 as dict key: {e}")
EOFRepository: NVIDIA/TensorRT-LLM
Length of output: 251
Hash function doesn't handle the reserved −1 return value.
In CPython, tp_hash returning -1 is reserved to signal an error. If object_id equals -1 (a valid pointer cast in some scenarios), this function will violate the CPython protocol — Python will interpret the return as an error, causing silent failures. CPython's hash implementations (long_hash, unicode_hash, _Py_HashBytes) handle this by remapping -1 to -2.
Proposed fix
static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)
{
if (self->object_id == 0)
{
PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");
return -1;
}
- return (Py_hash_t) self->object_id;
+ Py_hash_t h = (Py_hash_t) self->object_id;
+ if (h == -1)
+ h = -2;
+ return h;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /* ReferenceType.__hash__ */ | |
| static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self) | |
| { | |
| if (self->object_id == 0) | |
| { | |
| PyErr_SetString(PyExc_RuntimeError, "Reference is invalid"); | |
| return -1; | |
| } | |
| return (Py_hash_t) self->object_id; | |
| } | |
| /* ReferenceType.__hash__ */ | |
| static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self) | |
| { | |
| if (self->object_id == 0) | |
| { | |
| PyErr_SetString(PyExc_RuntimeError, "Reference is invalid"); | |
| return -1; | |
| } | |
| Py_hash_t h = (Py_hash_t) self->object_id; | |
| if (h == -1) | |
| h = -2; | |
| return h; | |
| } |
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c` around lines
94 - 103, ReferenceType_hash currently returns self->object_id directly which
can be -1, but tp_hash uses -1 to signal errors; update ReferenceType_hash to
detect when (Py_hash_t)self->object_id == -1 and remap that value to -2 before
returning. Locate the ReferenceType_hash function operating on
ReferenceTypeObject and ensure it still raises a RuntimeError when
self->object_id == 0, but after computing the hash cast to Py_hash_t, replace -1
with -2 to comply with the CPython hashing protocol.
|
PR_Github #35878 [ run ] completed with state
|
With this change, users are able to resize the storage quota for every cache level of the python KVCacheManager. Also fixed some previous bugs unveiled during development.
Summary by CodeRabbit
Release Notes
New Features
Refactor
Tests