Skip to content

Conversation

@lowsfer
Copy link
Member

@lowsfer lowsfer commented Feb 13, 2026

With this change, users are able to resize the storage quota for every cache level of the python KVCacheManager. Also fixed some previous bugs unveiled during development.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced KV cache management with dynamic quota adjustment based on historical usage patterns
    • Improved resource lifecycle tracking and graceful shutdown mechanism
    • Optimized page aggregation and memory allocation using memory-mapped I/O
  • Refactor

    • Streamlined internal cache manager architecture for better performance and resource efficiency
    • Reorganized eviction policy iteration for improved memory management
  • Tests

    • Expanded test coverage for multi-head configurations and quota resizing scenarios

@lowsfer lowsfer self-assigned this Feb 13, 2026
@lowsfer lowsfer requested review from eopXD and yizhang-nv February 13, 2026 05:54
@lowsfer
Copy link
Member Author

lowsfer commented Feb 13, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35878 [ run ] triggered by Bot. Commit: 50c5697

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

Comprehensive refactoring of KVCacheManagerV2 system from static to dynamic eviction-aware controller. Key changes include: removing BufferSlice from public API in favor of BufferId; introducing moving average statistics (Average, MovingAverage); expanding KVCacheManager with lifecycle tracking, shutdown, and ratio-based adjustment logic; refactoring storage layer to use TypedIndexList typed collections; implementing mmap-based memory allocation; adding eviction policy iteration support; and updating test infrastructure for per-head token handling.

Changes

Cohort / File(s) Summary
BufferSlice Removal
tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py, tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi, tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py
Removed BufferSlice from public exports across module hierarchy; AggregatedPageDesc.buffers changed from Sequence[BufferSlice] to Sequence[BufferId].
Moving Averages & Statistics
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py
New public utility classes: MovingAverage (decay-based running average) and Average (simple accumulator) for statistics tracking.
KVCacheManager Core Enhancement
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py, tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi
Major expansion with lifecycle tracking, living cache management, moving averages, shutdown() method, and ratio-driven adjustment logic (need_adjustment, adjust, _adjust_level, _gather_persistent_pages); get_aggregated_pages refactored to use BufferId and emit larger contiguous buffers.
KVCache Statistics Integration
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py
Added Average-based tracking fields (_avg_history_length, _avg_capacity) and lifecycle management (registration with manager, updating averages on resize/close).
Storage Layer Refactoring
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py, tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py
TypedIndexList migration replacing homogeneous tuple collections; refactored slot allocation with expand/prepare_for_shrink/finish_shrink flow; generalized pool group construction with typed helpers; updated pool management to compute quotas dynamically.
Storage Manager API Expansion
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
Added destroy() lifecycle method, and new public methods: new_slots_for_pool_group, get_ratio_list, shrink_pool_group, expand_pool_group, adjust_cache_level; updated signatures to use TypedIndexList for slot configurations; extended _batched_migrate with defrag parameter.
Memory Management & Utilities
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py
Replaced aligned allocator with mmap-based allocation (_mmap, _mremap, _munmap); added mmap constants (MAP_PRIVATE, MAP_ANONYMOUS, etc.); changed HostMem.ALIGNMENT from 2MB to 4KB; added typed_range and updated make_typed signature to accept Callable[[Index], T].
Eviction Controller & Page Handling
tensorrt_llm/runtime/kv_cache_manager_v2/_eviction_controller/_eviction_controller.py, tensorrt_llm/runtime/kv_cache_manager_v2/_page.py
Added __iter__ methods to eviction policies (LRUEvictionPolicy, PrioritizedEvictionPolicy) enabling page iteration; introduced page_iterator(pg_idx) method in PerLevelEvictionController; refactored _page.py to use life-cycle-to-pool-group mapping.
RawRef Enhancement
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.pyi, tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c
Added __hash__ method to ReferenceType for hashability; replaced internal validity tracking from int flag to object_id (Py_ssize_t) with 0 sentinel for invalid state.
Miscellaneous Updates
tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py
Added docstring documenting phys_mem_size parameter in PooledPhysMemAllocator.
Test Infrastructure & Kernels
tests/unittest/kv_cache_manager_v2_tests/fake_engine.py, tests/unittest/kv_cache_manager_v2_tests/kernels.py, tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py
Updated FakeEngine to accept num_heads parameter; refactored CUDA kernels (fillValues, checkValues) for per-head token indexing; removed sleep_time parameter; removed BufferSlice usage; added shutdown() calls in test teardown; introduced TestResizeQuota test class for quota resizing scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant KVM as KVCacheManager
    participant SM as StorageManager
    participant EC as EvictionController
    participant PG as PoolGroup

    App->>KVM: need_adjustment()
    KVM->>KVM: Compare current vs target<br/>ratios per level
    
    alt Adjustment needed
        App->>KVM: adjust()
        KVM->>KVM: _try_update_target_ratios()
        KVM->>KVM: Get living cache stats<br/>(history, capacity)
        
        KVM->>KVM: _adjust_level(GPU_LEVEL)
        KVM->>EC: page_iterator(pool_group_idx)
        EC-->>KVM: Iterator[EvictablePage]
        
        KVM->>KVM: _gather_persistent_pages()
        KVM->>SM: shrink_pool_group() or<br/>expand_pool_group()
        SM->>PG: Update pool capacity
        SM->>PG: Allocate/deallocate slots
        PG-->>SM: Confirmation
        SM-->>KVM: Complete
        
        KVM->>KVM: Repeat for HOST_LEVEL,<br/>DISK_LEVEL as needed
    end
    
    KVM-->>App: Adjustment complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • jiaganc
  • yizhang-nv
  • chuangz0
  • lfr-0531
🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.49% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description is minimal but present. It states the main objective (enabling quota resize) and mentions bug fixes, though it lacks detail on testing, specific changes, and checklist items from the template. Expand description to include: what specific bugs were fixed, which tests validate the changes, and confirmation of the PR checklist items (guidelines, dependencies, CODEOWNERS, documentation updates).
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly identifies the JIRA ticket, feature type, and main change: implementing dynamic quota resize for KVCacheManager v2.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py (1)

752-795: ⚠️ Potential issue | 🟡 Minor

_ratio_to_slot_count_list: potential division by zero if remaining ratios sum to zero.

Line 774: pct = ratio_list[pg] / sum(ratio_list[j] for j in pg_idx_lst[i:]) will raise ZeroDivisionError if all remaining ratios are 0. While unlikely in practice, callers could pass degenerate ratio lists. Consider adding a guard or an upfront assertion that all ratios are positive.

tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py (1)

355-361: ⚠️ Potential issue | 🟡 Minor

Missing f-prefix on error message strings — {pg_idx} and {goal} will appear literally.

These strings use {pg_idx} and {goal} but are not f-strings, so the variable values won't be interpolated. While these are pre-existing lines, they're in a method modified by this PR.

Proposed fix
                     raise OutOfPagesError(
-                        "Too many held pages are being evicted to the last-level cache for group {pg_idx}"
+                        f"Too many held pages are being evicted to the last-level cache for group {pg_idx}"
                     )
             if old_free_cnt + evictable_cnt - fallen_held_cnt < goal:
                 raise OutOfPagesError(
-                    "Impossible to meet the goal ({goal} free slots) for group {pg_idx}"
+                    f"Impossible to meet the goal ({goal} free slots) for group {pg_idx}"
                 )
🤖 Fix all issues with AI agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`:
- Around line 457-470: In ratio_from_length, protect against total == 0 before
computing x / total to avoid ZeroDivisionError: after computing total =
sum(num_bytes) add a guard that if total == 0 return a TypedIndexList where each
pool group gets an even share (e.g., 1.0 / num_pool_groups) or explicit zeros as
appropriate for downstream code; implement this using typed_fill/filled_list or
typed_map to produce the fallback list and otherwise proceed with return
typed_map(num_bytes, lambda x: x / total). Ensure you reference
ratio_from_length, num_bytes, total, num_pool_groups, and typed_map when making
the change.
- Around line 407-425: _gather_persistent_pages incorrectly asserts all living
caches are SUSPENDED which fails when _adjust_level is invoked from resize()
while caches may be ACTIVE; change the behavior in _gather_persistent_pages
(refer to function _gather_persistent_pages and symbol _living_kv_caches) to not
assert kv_cache.status == _KVCache.Status.SUSPENDED—either skip entries where
kv_cache.status != _KVCache.Status.SUSPENDED (continue) or explicitly handle
ACTIVE caches (e.g., suspend them safely before iterating) so the function no
longer raises on resize() calls that run this path.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py`:
- Around line 1-6: This file is missing the required NVIDIA Apache-2.0 copyright
header; add the standard NVIDIA copyright and Apache-2.0 license header (with
the year of latest meaningful modification) at the top of the file before the
class MovingAverage definition so the file contains the required license text
and copyright attribution.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py`:
- Around line 509-520: The destroy() method currently returns early when
allocator._capacity == 0 and therefore never destroys the pools or sets
self._destroyed; change the logic in destroy() (method name: destroy, fields:
_slot_allocator, _pools, _destroyed, allocator._capacity) so that when capacity
== 0 you skip allocator-specific calls
(synchronize/prepare_for_shrink/finish_shrink) but still iterate over
self._pools and call pool.destroy() and then set self._destroyed = True before
returning; ensure the normal path still does allocator._synchronize() and
allocator.prepare_for_shrink/finish_shrink and sets _destroyed at the end.
- Around line 681-691: The ratio_list method can divide by zero when total == 0;
update ratio_list (in _core.py) to check if total is zero before the division
loop and handle that case (for example, return the zero-filled list `ret`
immediately or otherwise avoid the division) so no ZeroDivisionError occurs;
keep the rest of the logic the same and reference the existing symbols:
ratio_list, num_pool_groups, self._pool_groups, total, and ret.
- Around line 395-402: In expand (method expand) add a guard to ensure you don't
discard an in-progress shrink: before resizing and setting
self._target_capacity, assert or raise unless self._target_capacity ==
self._capacity (the same check used in prepare_for_shrink), so expand
refuses/alerts if a shrink is active and thus avoids silently orphaning
self._overflow_slots; locate the symbols expand, _target_capacity, _capacity,
_overflow_slots and prepare_for_shrink and implement the check (or explicit
handling) to preserve or explicitly cancel any active shrink instead of
unconditionally resetting _target_capacity.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Line 346: The CDLL load is missing use_errno=True so ctypes.get_errno()
returns stale/zero values; update the _libc initialization (the ctypes.CDLL call
that assigns _libc) to pass use_errno=True so errno is captured for subsequent
calls in _mmap, _munmap, _mremap, and _madvise, ensuring those functions read
the correct error codes via ctypes.get_errno().

In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c`:
- Around line 94-103: ReferenceType_hash currently returns self->object_id
directly which can be -1, but tp_hash uses -1 to signal errors; update
ReferenceType_hash to detect when (Py_hash_t)self->object_id == -1 and remap
that value to -2 before returning. Locate the ReferenceType_hash function
operating on ReferenceTypeObject and ensure it still raises a RuntimeError when
self->object_id == 0, but after computing the hash cast to Py_hash_t, replace -1
with -2 to comply with the CPython hashing protocol.
🧹 Nitpick comments (13)
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py (3)

264-265: Variable shadowing: loop variable Index shadows the module-level TypeVar.

The list comprehension [generator(Index) for Index in typed_range(count)] uses Index as the loop variable, which shadows the TypeVar("Index") defined at line 60. This works at runtime but violates the coding guideline to "avoid shadowing variables declared in an outer scope" and is confusing to readers.

Proposed fix
 def make_typed(generator: Callable[[Index], T], count: Index) -> TypedIndexList[Index, T]:
-    return cast(TypedIndexList[Index, T], [generator(Index) for Index in typed_range(count)])
+    return cast(TypedIndexList[Index, T], [generator(i) for i in typed_range(count)])

As per coding guidelines: "Avoid shadowing variables declared in an outer scope".


374-377: Missing stacklevel in warnings.warn.

Per Ruff B028, warnings.warn without stacklevel defaults to 1, which points to this helper function rather than the caller. Use stacklevel=2 so the warning points to the actual call site.

Proposed fix
         warnings.warn(
             f"madvise failed with errno {error_code}: {errno.errorcode.get(error_code, 'Unknown error')}"
+            , stacklevel=2
         )

406-410: Missing stacklevel=2 in _munmap warning.

Same issue as _madvise — the warning will attribute to this function rather than the caller.

Proposed fix
-        warnings.warn(f"munmap failed with errno {error_code}")
+        warnings.warn(f"munmap failed with errno {error_code}", stacklevel=2)
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py (1)

38-40: Average.value will raise ZeroDivisionError if called before any update().

Current usage always calls update() before accessing value, so this is safe in practice. However, a guard would make the class more robust.

Proposed defensive fix
     `@property`
     def value(self) -> float:
+        if self.count == 0:
+            return 0.0
         return self.sum / self.count
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py (1)

944-944: Address the FIXME: unused _ parameter in get_num_matched_tokens.

The # @fixme: remove the _ parameter comment suggests this is a known workaround. The function is called with matched as an argument (lines 978, 1001, 1026) but ignores it — it captures matched from the enclosing scope via closure instead.

Would you like me to open an issue to track cleaning up this parameter?

tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py (1)

1097-1098: Local GPU_LEVEL shadows the module-level import.

GPU_LEVEL is already imported at module scope (line 52/93). The local redefinition GPU_LEVEL = CacheLevel(0) creates the same value but shadows the import, which the coding guidelines recommend avoiding. HOST_LEVEL is fine since it's not imported.

Proposed fix
-        GPU_LEVEL = CacheLevel(0)
         HOST_LEVEL = CacheLevel(1)
         # Shrink the gpu quota
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py (1)

384-395: zip() without strict=True on ratio comparison.

Line 390 zips two TypedIndexList[PoolGroupIndex, float] without strict=True. These should always have equal length, so adding strict=True would catch bugs early (as Ruff B905 also flags).

Proposed fix
-            return any(not (1 / thres < x / y < thres) for x, y in zip(a, b))
+            return any(not (1 / thres < x / y < thres) for x, y in zip(a, b, strict=True))
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py (3)

647-650: Unused __init__ parameters total_quota and ratio_list.

CacheLevelStorage.__init__ accepts total_quota and ratio_list but never uses them. All subclasses pass these values via super().__init__(), creating a false impression that the base class stores or validates them. Consider either removing these parameters or adding validation logic here (e.g., asserting total_quota > 0, sum(ratio_list) ≈ 1.0).

Option A: Remove unused params
-    def __init__(self, total_quota: int, ratio_list: TypedIndexList[PoolGroupIndex, float]) -> None:
+    def __init__(self) -> None:
         if not hasattr(self.__class__, "TIER"):
             raise ValueError(f"{self.__class__.__name__} must define 'TIER' as a class variable")

This would require updating all super().__init__(total_quota, init_ratio) calls in subclasses.


429-451: finish_shrink return type is misleading — it returns True or raises, never False.

The -> bool return type suggests the caller can check success/failure, but the method always returns True on success and raises RuntimeError on failure. This can mislead callers into writing if not allocator.finish_shrink(): ... which is dead code.

Consider changing the return type to -> None (and removing return True), or to -> Literal[True], to accurately signal the contract.

Proposed fix (return None)
-    def finish_shrink(self) -> bool:
+    def finish_shrink(self) -> None:
         assert NDEBUG or self._check()
         if (
             self.shrink_in_progress
             and self._target_capacity + len(self._overflow_slots) == self._num_active_slots
         ):
             ...
             self._scrub_events()
             assert NDEBUG or self._check()
-            return True
+            return
         raise RuntimeError("shrink can't be finished")

489-493: Potential infinite loop in _synchronize.

_synchronize spins until all recycled slots are ready, but _scrub_events only advances _num_ready_recycled_slots by scanning forward from the current ready frontier and stopping at the first non-ready event. If a slot in the middle has a long-running or stuck CUDA event, this will busy-wait indefinitely with no timeout and no backoff.

This may be intentional for a synchronization primitive, but consider adding a safety timeout or at minimum a time.sleep to avoid CPU burn.

tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py (3)

253-266: Exception handler in new_slots_for_pool_group has no cleanup — unlike new_slots.

In new_slots (line 244), the except block releases all partially allocated slots. Here (line 264), the except block only warns and re-raises. Since allocate_multiple is all-or-none and the assertion on line 261 should guarantee success, this is likely safe. But if the intent is a safety net, it should match the pattern in new_slots.

Also, per Ruff B028, warnings.warn should include stacklevel=2 so the warning points to the caller.

Proposed fix
         try:
             return storage.allocate_multiple(pg_idx, num_slots)
         except Exception:
-            warnings.warn("Exception not expected here. Please report a bug.")
+            warnings.warn("Exception not expected here. Please report a bug.", stacklevel=2)
             raise

573-621: shrink_pool_group — complex logic with subtle eviction/defrag interplay; a few observations.

  1. Line 600-603: The while-loop condition len(overflow_slots) + num_overflow_persistent > min(new_num_slots, overflow_slots[0][0] + allocator.num_free_slots) relies on the eviction controller's iterator index (overflow_slots[0][0]) corresponding to the eviction order used by force_evict. If these orderings diverge, the computed min_num_evicted will be wrong, potentially leaving the shrink unable to complete. This coupling is fragile and deserves a comment or assertion.

  2. Line 616-620: The assertion len(allocator._overflow_slots) == allocator._num_active_slots - allocator._target_capacity directly accesses allocator internals. Consider exposing a method on SlotAllocator (e.g., is_ready_to_finish_shrink()) to encapsulate this invariant check.

Suggestion: Add a clarifying comment for the eviction loop
         allocator.prepare_for_shrink(new_num_slots)
         min_num_evicted = 0
+        # Evict pages from the front of the eviction order until all remaining
+        # overflow pages (evictable + persistent) can be relocated into [0, new_num_slots).
         while overflow_slots and len(overflow_slots) + num_overflow_persistent > min(
             new_num_slots, overflow_slots[0][0] + allocator.num_free_slots
         ):
             min_num_evicted = overflow_slots.popleft()[0] + 1

631-663: adjust_cache_level shrinks before expanding — good memory hygiene.

Shrinking first frees resources that may be needed for expanding other pool groups. The round-up of new_quota on line 645 ensures alignment with pool size granularity.

One note: if shrink_pool_group fails (raises) for one pool group, subsequent groups won't be processed, leaving the cache level in a partially resized state. The caller should be aware of this non-atomic behavior. A brief docstring note would help.

Docstring enhancement
     def adjust_cache_level(
         self,
         level: CacheLevel,
         new_quota: int | None,
         new_ratio_list: TypedIndexList[PoolGroupIndex, float],
         persistent_pages: TypedIndexList[PoolGroupIndex, list[Page]] | None = None,
     ) -> None:
-        """Adapt the cache level by adjusting the ratio list. Persistent pages are those held and not evictable."""
+        """Adapt the cache level by adjusting the ratio list. Persistent pages are those held and not evictable.
+
+        Note: This operation is not atomic. If a shrink or expand fails for one pool group,
+        the cache level may be left in a partially resized state.
+        """

Comment on lines +407 to +425
def _gather_persistent_pages(self) -> TypedIndexList[PoolGroupIndex, list[Page]]:
last_level = self._storage.num_cache_levels - 1
lc2pg = self._storage._life_cycle_grouping
ret = make_typed(lambda _: list[Page](), self._storage.num_pool_groups)
for r in self._living_kv_caches:
kv_cache = unwrap_rawref(r)
assert kv_cache.status == _KVCache.Status.SUSPENDED
for block in kv_cache._blocks:
for beam in block.pages:
for lc, holder in typed_enumerate(beam):
if holder is None:
continue
assert type(holder) is _PageHolder
page = holder.page
assert page.status == PageStatus.HELD
assert page.scheduled_for_eviction == (page.cache_level != last_level)
if not page.scheduled_for_eviction:
ret[lc2pg[lc]].append(holder.page)
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

_gather_persistent_pages asserts all living caches are SUSPENDED.

This is correct for the adjust() path (line 438-439 also asserts this), but _adjust_level is also called from resize() (line 242) where caches may be ACTIVE. If level == num_cache_levels - 1 in a resize() call, _gather_persistent_pages would fire and could hit the assertion on line 413.

#!/bin/bash
# Verify whether resize() can be called when caches are active and whether
# the last-level path in _adjust_level is reachable from resize()
rg -n '_adjust_level\|_gather_persistent_pages\|def resize' --type=py -g '*_kv_cache_manager*'
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py` around
lines 407 - 425, _gather_persistent_pages incorrectly asserts all living caches
are SUSPENDED which fails when _adjust_level is invoked from resize() while
caches may be ACTIVE; change the behavior in _gather_persistent_pages (refer to
function _gather_persistent_pages and symbol _living_kv_caches) to not assert
kv_cache.status == _KVCache.Status.SUSPENDED—either skip entries where
kv_cache.status != _KVCache.Status.SUSPENDED (continue) or explicitly handle
ACTIVE caches (e.g., suspend them safely before iterating) so the function no
longer raises on resize() calls that run this path.

Comment on lines +457 to +470
def ratio_from_length(
history_length: int, capacity: int
) -> TypedIndexList[PoolGroupIndex, float]:
num_blocks = div_up(capacity, tokens_per_blocks)
num_bytes = filled_list(0.0, num_pool_groups)
for lc_idx, lc in typed_enumerate(life_cycles):
stale_beg, stale_end = _KVCache._get_stale_range(
tokens_per_blocks, history_length, lc
)
pg_idx = lc2pg[lc_idx]
slot_size = storage.slot_size(pg_idx)
num_bytes[pg_idx] += (num_blocks - (stale_end - stale_beg)) * sum(slot_size)
total = sum(num_bytes)
return typed_map(num_bytes, lambda x: x / total)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential ZeroDivisionError in ratio_from_length when total == 0.

On line 470, x / total will raise ZeroDivisionError if all pool groups end up with zero effective bytes (e.g., all blocks are stale). While this is unlikely in practice for avg_capacity > 0, it's worth adding a guard.

Proposed guard
             total = sum(num_bytes)
+            if total == 0:
+                return filled_list(1.0 / num_pool_groups, num_pool_groups)
             return typed_map(num_bytes, lambda x: x / total)
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py` around
lines 457 - 470, In ratio_from_length, protect against total == 0 before
computing x / total to avoid ZeroDivisionError: after computing total =
sum(num_bytes) add a guard that if total == 0 return a TypedIndexList where each
pool group gets an even share (e.g., 1.0 / num_pool_groups) or explicit zeros as
appropriate for downstream code; implement this using typed_fill/filled_list or
typed_map to produce the fallback list and otherwise proceed with return
typed_map(num_bytes, lambda x: x / total). Ensure you reference
ratio_from_length, num_bytes, total, num_pool_groups, and typed_map when making
the change.

Comment on lines +1 to +6
class MovingAverage:
__slots__ = ("decay", "avg", "weight", "num_updates")
decay: float
avg: float
weight: float
num_updates: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

This is a new file and must include the NVIDIA Apache 2.0 copyright header. As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 class MovingAverage:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class MovingAverage:
__slots__ = ("decay", "avg", "weight", "num_updates")
decay: float
avg: float
weight: float
num_updates: int
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
class MovingAverage:
__slots__ = ("decay", "avg", "weight", "num_updates")
decay: float
avg: float
weight: float
num_updates: int
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 2-2: MovingAverage.__slots__ is not sorted

Apply a natural sort to MovingAverage.__slots__

(RUF023)

🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py` around
lines 1 - 6, This file is missing the required NVIDIA Apache-2.0 copyright
header; add the standard NVIDIA copyright and Apache-2.0 license header (with
the year of latest meaningful modification) at the top of the file before the
class MovingAverage definition so the file contains the required license text
and copyright attribution.

Comment on lines +395 to 402
def expand(self, new_num_slots: int) -> None:
assert NDEBUG or self._check()
old_num_slots = self.num_slots
if new_num_slots < self.num_slots and self._occupied_mask.any_set(
new_num_slots, self.num_slots
):
raise ResourceBusyError("resize cannot remove occupied slots")
old_num_slots = self._capacity
assert new_num_slots > old_num_slots
self._occupied_mask.resize(new_num_slots)
self._capacity = new_num_slots
self._num_active_slots = min(self._num_active_slots, new_num_slots)
if new_num_slots < old_num_slots:
new_recycled_slots = deque[Slot]()
new_num_ready_recycled_slots = 0
for idx_recycled, slot in enumerate(self._recycled_slots):
assert type(slot) is Slot and slot.has_valid_slot
if slot.slot_id >= new_num_slots:
slot.ready_event.synchronize()
slot._slot_id = None
slot.ready_event = CachedCudaEvent.NULL
else:
new_recycled_slots.append(slot)
if idx_recycled < self._num_ready_recycled_slots:
new_num_ready_recycled_slots += 1
self._recycled_slots = new_recycled_slots
self._num_ready_recycled_slots = new_num_ready_recycled_slots
self._scrub_events()
self._target_capacity = self._capacity
assert NDEBUG or self._check()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

expand resets _target_capacity — verify no in-progress shrink is active.

expand unconditionally sets _target_capacity = _capacity on line 401, but there's no assertion that _target_capacity == _capacity before expanding (unlike prepare_for_shrink which asserts it on line 406). If expand is called while a shrink is in progress, it would silently discard the shrink state and orphan any _overflow_slots.

Proposed fix: add a guard
     def expand(self, new_num_slots: int) -> None:
         assert NDEBUG or self._check()
         old_num_slots = self._capacity
         assert new_num_slots > old_num_slots
+        assert not self.shrink_in_progress, "Cannot expand while a shrink is in progress"
         self._occupied_mask.resize(new_num_slots)
         self._capacity = new_num_slots
         self._target_capacity = self._capacity
         assert NDEBUG or self._check()
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 395
- 402, In expand (method expand) add a guard to ensure you don't discard an
in-progress shrink: before resizing and setting self._target_capacity, assert or
raise unless self._target_capacity == self._capacity (the same check used in
prepare_for_shrink), so expand refuses/alerts if a shrink is active and thus
avoids silently orphaning self._overflow_slots; locate the symbols expand,
_target_capacity, _capacity, _overflow_slots and prepare_for_shrink and
implement the check (or explicit handling) to preserve or explicitly cancel any
active shrink instead of unconditionally resetting _target_capacity.

Comment on lines 509 to +520
def destroy(self) -> None:
if self._slot_allocator._capacity == 0:
if self._destroyed:
return
allocator = self._slot_allocator
if allocator._capacity == 0:
return
self._slot_allocator._synchronize()
allocator._synchronize()
for pool in self._pools:
pool.destroy()
self._slot_allocator.resize(0)
allocator.prepare_for_shrink(0)
allocator.finish_shrink()
self._destroyed = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

destroy() skips pool cleanup when _capacity == 0.

Line 513-514: if allocator._capacity == 0, destroy() returns before destroying pools or marking _destroyed = True. On a subsequent call, _destroyed is still False, so it enters again and hits the same early return — pools are never destroyed and _destroyed is never set. This can leak host/disk resources in edge cases where a PoolGroupBase was constructed with num_slots=0.

Proposed fix
     def destroy(self) -> None:
         if self._destroyed:
             return
         allocator = self._slot_allocator
         if allocator._capacity == 0:
+            for pool in self._pools:
+                pool.destroy()
+            self._destroyed = True
             return
         allocator._synchronize()
         for pool in self._pools:
             pool.destroy()
         allocator.prepare_for_shrink(0)
         allocator.finish_shrink()
         self._destroyed = True
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 509
- 520, The destroy() method currently returns early when allocator._capacity ==
0 and therefore never destroys the pools or sets self._destroyed; change the
logic in destroy() (method name: destroy, fields: _slot_allocator, _pools,
_destroyed, allocator._capacity) so that when capacity == 0 you skip
allocator-specific calls (synchronize/prepare_for_shrink/finish_shrink) but
still iterate over self._pools and call pool.destroy() and then set
self._destroyed = True before returning; ensure the normal path still does
allocator._synchronize() and allocator.prepare_for_shrink/finish_shrink and sets
_destroyed at the end.

Comment on lines +681 to +691
def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]:
num_pool_groups = self.num_pool_groups
ret = filled_list(0.0, num_pool_groups)
total = 0
for i, pg in typed_enumerate(self._pool_groups):
size = pg.num_bytes
total += size
ret[i] = size
for i in typed_range(num_pool_groups):
ret[i] /= total
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Division by zero when all pools have zero bytes.

If total is 0 (e.g., after a destroy() or when all pool groups are empty), ret[i] /= total on line 690 will raise ZeroDivisionError.

Proposed fix
         for i, pg in typed_enumerate(self._pool_groups):
             size = pg.num_bytes
             total += size
             ret[i] = size
-        for i in typed_range(num_pool_groups):
-            ret[i] /= total
+        if total > 0:
+            for i in typed_range(num_pool_groups):
+                ret[i] /= total
+        else:
+            # Equal ratios when no bytes are allocated
+            uniform = 1.0 / int(num_pool_groups) if int(num_pool_groups) > 0 else 0.0
+            for i in typed_range(num_pool_groups):
+                ret[i] = uniform
         return ret
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]:
num_pool_groups = self.num_pool_groups
ret = filled_list(0.0, num_pool_groups)
total = 0
for i, pg in typed_enumerate(self._pool_groups):
size = pg.num_bytes
total += size
ret[i] = size
for i in typed_range(num_pool_groups):
ret[i] /= total
return ret
def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]:
num_pool_groups = self.num_pool_groups
ret = filled_list(0.0, num_pool_groups)
total = 0
for i, pg in typed_enumerate(self._pool_groups):
size = pg.num_bytes
total += size
ret[i] = size
if total > 0:
for i in typed_range(num_pool_groups):
ret[i] /= total
else:
# Equal ratios when no bytes are allocated
uniform = 1.0 / int(num_pool_groups) if int(num_pool_groups) > 0 else 0.0
for i in typed_range(num_pool_groups):
ret[i] = uniform
return ret
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 681
- 691, The ratio_list method can divide by zero when total == 0; update
ratio_list (in _core.py) to check if total is zero before the division loop and
handle that case (for example, return the zero-filled list `ret` immediately or
otherwise avoid the division) so no ZeroDivisionError occurs; keep the rest of
the logic the same and reference the existing symbols: ratio_list,
num_pool_groups, self._pool_groups, total, and ret.

MREMAP_MAYMOVE: Final[int] = 1
MAP_FAILED: Final[int] = -1

_libc = ctypes.CDLL(find_library("c"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file and find the context around line 346
head -400 tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py | tail -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 3048


🏁 Script executed:

# Search for all uses of ctypes.get_errno() in the file
rg "ctypes\.get_errno|get_errno" tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py -n

Repository: NVIDIA/TensorRT-LLM

Length of output: 238


🏁 Script executed:

# Check the entire file structure to understand the functions mentioned
rg "_mmap|_munmap|_mremap|_madvise" tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py -n -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2113


🌐 Web query:

Python ctypes CDLL use_errno parameter get_errno behavior

💡 Result:

ctypes.CDLL(..., use_errno=True) (and similarly CFUNCTYPE(..., use_errno=True)) does not make ctypes.get_errno() read the process’s live C errno directly. Instead, it enables an errno swapping mechanism around that specific ctypes call:

  • ctypes keeps a thread-local “private copy” of errno.
  • For a foreign function invoked via ctypes that was created with use_errno=True:
    1. Before the call: the OS/C errno is swapped with ctypes’ private copy.
    2. After the call: they’re swapped again, so ctypes’ private copy now holds the errno value produced by the C call.
      Then ctypes.get_errno() returns that private copy. [1]

So the practical behavior is:

  • If you call a ctypes function without use_errno=True, ctypes will typically not capture/update its private errno copy for that call, and ctypes.get_errno() may show an old value. [1]
  • ctypes.get_errno() will not reflect errno set by non-ctypes operations (e.g., Python’s os.open()), because it reads ctypes’ private copy, not the live errno. [1][2]
  • Setting use_errno=True on the library (CDLL) tends to apply as the default for functions you access as attributes from that library object (and adds overhead to each call), whereas you can also enable it only for specific callables via function prototypes. [1][2]

Sources: Python ctypes docs on use_errno, get_errno(), and the swap mechanism [1]; explanation/usage notes illustrating that get_errno() reads ctypes’ private copy (and needs use_errno=True on the ctypes call) [2].

[1] (docs.python.org)
[2] (stackoverflow.com)

Citations:


Critical bug: ctypes.CDLL missing use_errno=Truectypes.get_errno() will always return stale/zero values.

Without use_errno=True, ctypes does not capture errno values from C function calls. Since the code calls ctypes.get_errno() in _mmap, _munmap, _mremap, and _madvise to retrieve error codes, all error diagnostics will be unreliable and display incorrect values.

Proposed fix
-_libc = ctypes.CDLL(find_library("c"))
+_libc = ctypes.CDLL(find_library("c"), use_errno=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_libc = ctypes.CDLL(find_library("c"))
_libc = ctypes.CDLL(find_library("c"), use_errno=True)
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py` at line 346, The CDLL
load is missing use_errno=True so ctypes.get_errno() returns stale/zero values;
update the _libc initialization (the ctypes.CDLL call that assigns _libc) to
pass use_errno=True so errno is captured for subsequent calls in _mmap, _munmap,
_mremap, and _madvise, ensuring those functions read the correct error codes via
ctypes.get_errno().

Comment on lines +94 to +103
/* ReferenceType.__hash__ */
static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)
{
if (self->object_id == 0)
{
PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");
return -1;
}
return (Py_hash_t) self->object_id;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

CPython tp_hash protocol -1 return value error convention

💡 Result:

In CPython, a type’s tp_hash slot has the same contract as PyObject_Hash():

  • -1 is reserved to mean “error”. A tp_hash implementation must not return -1 as a normal hash value; if hashing fails, it must set an exception and return -1. [1]
  • Correspondingly, PyObject_Hash(o) returns -1 on failure (with an exception set), and PyObject_HashNotImplemented(o) sets TypeError (“unhashable type”) and returns -1 (and is specially recognized when stored in tp_hash). [2]

Because -1 is the error sentinel, CPython ensures a “real” hash is never -1: if the computed hash would be -1, CPython remaps it (commonly to -2) in core hash implementations (e.g., the integer hash code path shown here). [3]

Practical rule when writing tp_hash: if your computed hash equals -1, return -2 (or any other non--1 value) instead; reserve -1 strictly for signaling an exception. [1][3]


🌐 Web query:

CPython long_hash unicode_hash -1 handling implementation

💡 Result:

In CPython, Py_hash_t returning -1 is reserved to mean “error occurred” in the tp_hash / PyObject_Hash() protocol, so CPython must ensure no successful hash computation ever returns -1.

long_hash (hash of int)

Objects/longobject.c:long_hash() computes an integer hash and then rewrites -1 to -2 before returning: if (x == (Py_uhash_t)-1) x = (Py_uhash_t)-2;. This is why hash(-1) == -2 on CPython. [1]

unicode_hash (hash of str)

Objects/unicodeobject.c:unicode_hash() uses the cached per-string hash value; the cache sentinel for “not yet computed” is -1 (if (hash != -1) return hash;). When the hash is computed, it calls the core bytes/string hashing routine (Py_HashBuffer / _Py_HashBytes) and stores the result. [2]

Where the “never return -1” rule is enforced for string/bytes hashing

The core helper _Py_HashBytes() explicitly ensures a computed hash value of -1 is converted to -2 (it returns -2 in that case). [3]

If a Python-level __hash__ returns -1

Even for user-defined classes, the slot wrapper in Objects/typeobject.c applies the same rule: -1 is reserved for errors” and is rewritten to -2. [4]

Sources:
[1] Objects/longobject.c (long_hash) showing -1-2 remap. (chromium.googlesource.com)
[2] Objects/unicodeobject.c (unicode_hash) showing cached-hash sentinel -1 and computing via Py_HashBuffer / _Py_HashBytes. (chromium.googlesource.com)
[3] Python/pyhash.c (_Py_HashBytes) showing -1-2 handling. (sources.debian.org)
[4] Objects/typeobject.c (slot_tp_hash) rewriting h == -1 to -2. (chromium.googlesource.com)

Citations:


🏁 Script executed:

# Search for real examples of how CPython hash functions handle -1
# Let's check if there are any relevant Python documentation or examples
python3 <<'EOF'
import sys
print(f"Python version: {sys.version}")

# Test if -1 is special in hash
try:
    # Create a custom object with hash returning -1
    class BadHash:
        def __hash__(self):
            return -1
    
    obj = BadHash()
    try:
        h = hash(obj)
        print(f"hash() with -1 return: {h}")
    except Exception as e:
        print(f"hash() with -1 return raises: {type(e).__name__}: {e}")
        
except Exception as e:
    print(f"Error in test: {e}")

# Check if -1 is actually reserved
print(f"\nChecking -1 as hash value:")
d = {}
try:
    d[-1] = "test"
    print(f"Can use -1 as dict key: yes")
    print(f"Value: {d[-1]}")
except Exception as e:
    print(f"Cannot use -1 as dict key: {e}")
EOF

Repository: NVIDIA/TensorRT-LLM

Length of output: 251


Hash function doesn't handle the reserved −1 return value.

In CPython, tp_hash returning -1 is reserved to signal an error. If object_id equals -1 (a valid pointer cast in some scenarios), this function will violate the CPython protocol — Python will interpret the return as an error, causing silent failures. CPython's hash implementations (long_hash, unicode_hash, _Py_HashBytes) handle this by remapping -1 to -2.

Proposed fix
 static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)
 {
     if (self->object_id == 0)
     {
         PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");
         return -1;
     }
-    return (Py_hash_t) self->object_id;
+    Py_hash_t h = (Py_hash_t) self->object_id;
+    if (h == -1)
+        h = -2;
+    return h;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/* ReferenceType.__hash__ */
static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)
{
if (self->object_id == 0)
{
PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");
return -1;
}
return (Py_hash_t) self->object_id;
}
/* ReferenceType.__hash__ */
static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)
{
if (self->object_id == 0)
{
PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");
return -1;
}
Py_hash_t h = (Py_hash_t) self->object_id;
if (h == -1)
h = -2;
return h;
}
🤖 Prompt for AI Agents
In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c` around lines
94 - 103, ReferenceType_hash currently returns self->object_id directly which
can be -1, but tp_hash uses -1 to signal errors; update ReferenceType_hash to
detect when (Py_hash_t)self->object_id == -1 and remap that value to -2 before
returning. Locate the ReferenceType_hash function operating on
ReferenceTypeObject and ensure it still raises a RuntimeError when
self->object_id == 0, but after computing the hash cast to Py_hash_t, replace -1
with -2 to comply with the CPython hashing protocol.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35878 [ run ] completed with state SUCCESS. Commit: 50c5697
/LLM/main/L0_MergeRequest_PR pipeline #27708 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants