[TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 #11503

lowsfer · 2026-02-13T05:54:05Z

With this change, users are able to resize the storage quota for every cache level of the python KVCacheManager. Also fixed some previous bugs unveiled during development.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced KV cache management with dynamic quota adjustment based on historical usage patterns
- Improved resource lifecycle tracking and graceful shutdown mechanism
- Optimized page aggregation and memory allocation using memory-mapped I/O
Refactor
- Streamlined internal cache manager architecture for better performance and resource efficiency
- Reorganized eviction policy iteration for improved memory management
Tests
- Expanded test coverage for multi-head configurations and quota resizing scenarios

Signed-off-by: Yao Yao <[email protected]>

lowsfer · 2026-02-13T05:55:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-13T06:01:32Z

PR_Github #35878 [ run ] triggered by Bot. Commit: 50c5697

coderabbitai · 2026-02-13T06:02:30Z

📝 Walkthrough

Walkthrough

Comprehensive refactoring of KVCacheManagerV2 system from static to dynamic eviction-aware controller. Key changes include: removing BufferSlice from public API in favor of BufferId; introducing moving average statistics (Average, MovingAverage); expanding KVCacheManager with lifecycle tracking, shutdown, and ratio-based adjustment logic; refactoring storage layer to use TypedIndexList typed collections; implementing mmap-based memory allocation; adding eviction policy iteration support; and updating test infrastructure for per-head token handling.

Changes

Cohort / File(s)	Summary
BufferSlice Removal `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`, `tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py`	Removed BufferSlice from public exports across module hierarchy; AggregatedPageDesc.buffers changed from Sequence[BufferSlice] to Sequence[BufferId].
Moving Averages & Statistics `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py`	New public utility classes: MovingAverage (decay-based running average) and Average (simple accumulator) for statistics tracking.
KVCacheManager Core Enhancement `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`	Major expansion with lifecycle tracking, living cache management, moving averages, shutdown() method, and ratio-driven adjustment logic (need_adjustment, adjust, _adjust_level, _gather_persistent_pages); get_aggregated_pages refactored to use BufferId and emit larger contiguous buffers.
KVCache Statistics Integration `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`	Added Average-based tracking fields (_avg_history_length, _avg_capacity) and lifecycle management (registration with manager, updating averages on resize/close).
Storage Layer Refactoring `tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py`	TypedIndexList migration replacing homogeneous tuple collections; refactored slot allocation with expand/prepare_for_shrink/finish_shrink flow; generalized pool group construction with typed helpers; updated pool management to compute quotas dynamically.
Storage Manager API Expansion `tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py`	Added destroy() lifecycle method, and new public methods: new_slots_for_pool_group, get_ratio_list, shrink_pool_group, expand_pool_group, adjust_cache_level; updated signatures to use TypedIndexList for slot configurations; extended _batched_migrate with defrag parameter.
Memory Management & Utilities `tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`	Replaced aligned allocator with mmap-based allocation (_mmap, _mremap, _munmap); added mmap constants (MAP_PRIVATE, MAP_ANONYMOUS, etc.); changed HostMem.ALIGNMENT from 2MB to 4KB; added typed_range and updated make_typed signature to accept Callable[[Index], T].
Eviction Controller & Page Handling `tensorrt_llm/runtime/kv_cache_manager_v2/_eviction_controller/_eviction_controller.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_page.py`	Added __iter__ methods to eviction policies (LRUEvictionPolicy, PrioritizedEvictionPolicy) enabling page iteration; introduced page_iterator(pg_idx) method in PerLevelEvictionController; refactored _page.py to use life-cycle-to-pool-group mapping.
RawRef Enhancement `tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.pyi`, `tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c`	Added __hash__ method to ReferenceType for hashability; replaced internal validity tracking from int flag to object_id (Py_ssize_t) with 0 sentinel for invalid state.
Miscellaneous Updates `tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py`	Added docstring documenting phys_mem_size parameter in PooledPhysMemAllocator.
Test Infrastructure & Kernels `tests/unittest/kv_cache_manager_v2_tests/fake_engine.py`, `tests/unittest/kv_cache_manager_v2_tests/kernels.py`, `tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py`	Updated FakeEngine to accept num_heads parameter; refactored CUDA kernels (fillValues, checkValues) for per-head token indexing; removed sleep_time parameter; removed BufferSlice usage; added shutdown() calls in test teardown; introduced TestResizeQuota test class for quota resizing scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant KVM as KVCacheManager
    participant SM as StorageManager
    participant EC as EvictionController
    participant PG as PoolGroup

    App->>KVM: need_adjustment()
    KVM->>KVM: Compare current vs target<br/>ratios per level
    
    alt Adjustment needed
        App->>KVM: adjust()
        KVM->>KVM: _try_update_target_ratios()
        KVM->>KVM: Get living cache stats<br/>(history, capacity)
        
        KVM->>KVM: _adjust_level(GPU_LEVEL)
        KVM->>EC: page_iterator(pool_group_idx)
        EC-->>KVM: Iterator[EvictablePage]
        
        KVM->>KVM: _gather_persistent_pages()
        KVM->>SM: shrink_pool_group() or<br/>expand_pool_group()
        SM->>PG: Update pool capacity
        SM->>PG: Allocate/deallocate slots
        PG-->>SM: Confirmation
        SM-->>KVM: Complete
        
        KVM->>KVM: Repeat for HOST_LEVEL,<br/>DISK_LEVEL as needed
    end
    
    KVM-->>App: Adjustment complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

[None][feat] New KVCacheManagerV2 APIs for Transceiver #11003 — Modifies same KVCacheManagerV2 internals with TypedIndexList type migration and buffer/pool/slot representation changes.

Suggested reviewers

jiaganc
yizhang-nv
chuangz0
lfr-0531

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.49% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is minimal but present. It states the main objective (enabling quota resize) and mentions bug fixes, though it lacks detail on testing, specific changes, and checklist items from the template.	Expand description to include: what specific bugs were fixed, which tests validate the changes, and confirmation of the PR checklist items (guidelines, dependencies, CODEOWNERS, documentation updates).

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies the JIRA ticket, feature type, and main change: implementing dynamic quota resize for KVCacheManager v2.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py (1)

752-795: ⚠️ Potential issue | 🟡 Minor

_ratio_to_slot_count_list: potential division by zero if remaining ratios sum to zero.

Line 774: pct = ratio_list[pg] / sum(ratio_list[j] for j in pg_idx_lst[i:]) will raise ZeroDivisionError if all remaining ratios are 0. While unlikely in practice, callers could pass degenerate ratio lists. Consider adding a guard or an upfront assertion that all ratios are positive.
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py (1)
355-361: ⚠️ Potential issue | 🟡 Minor

Missing f-prefix on error message strings — {pg_idx} and {goal} will appear literally.

These strings use {pg_idx} and {goal} but are not f-strings, so the variable values won't be interpolated. While these are pre-existing lines, they're in a method modified by this PR.
Proposed fix
                     raise OutOfPagesError(
-                        "Too many held pages are being evicted to the last-level cache for group {pg_idx}"
+                        f"Too many held pages are being evicted to the last-level cache for group {pg_idx}"
                     )
             if old_free_cnt + evictable_cnt - fallen_held_cnt < goal:
                 raise OutOfPagesError(
-                    "Impossible to meet the goal ({goal} free slots) for group {pg_idx}"
+                    f"Impossible to meet the goal ({goal} free slots) for group {pg_idx}"
                 )

🤖 Fix all issues with AI agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`:
- Around line 457-470: In ratio_from_length, protect against total == 0 before
computing x / total to avoid ZeroDivisionError: after computing total =
sum(num_bytes) add a guard that if total == 0 return a TypedIndexList where each
pool group gets an even share (e.g., 1.0 / num_pool_groups) or explicit zeros as
appropriate for downstream code; implement this using typed_fill/filled_list or
typed_map to produce the fallback list and otherwise proceed with return
typed_map(num_bytes, lambda x: x / total). Ensure you reference
ratio_from_length, num_bytes, total, num_pool_groups, and typed_map when making
the change.
- Around line 407-425: _gather_persistent_pages incorrectly asserts all living
caches are SUSPENDED which fails when _adjust_level is invoked from resize()
while caches may be ACTIVE; change the behavior in _gather_persistent_pages
(refer to function _gather_persistent_pages and symbol _living_kv_caches) to not
assert kv_cache.status == _KVCache.Status.SUSPENDED—either skip entries where
kv_cache.status != _KVCache.Status.SUSPENDED (continue) or explicitly handle
ACTIVE caches (e.g., suspend them safely before iterating) so the function no
longer raises on resize() calls that run this path.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py`:
- Around line 1-6: This file is missing the required NVIDIA Apache-2.0 copyright
header; add the standard NVIDIA copyright and Apache-2.0 license header (with
the year of latest meaningful modification) at the top of the file before the
class MovingAverage definition so the file contains the required license text
and copyright attribution.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py`:
- Around line 509-520: The destroy() method currently returns early when
allocator._capacity == 0 and therefore never destroys the pools or sets
self._destroyed; change the logic in destroy() (method name: destroy, fields:
_slot_allocator, _pools, _destroyed, allocator._capacity) so that when capacity
== 0 you skip allocator-specific calls
(synchronize/prepare_for_shrink/finish_shrink) but still iterate over
self._pools and call pool.destroy() and then set self._destroyed = True before
returning; ensure the normal path still does allocator._synchronize() and
allocator.prepare_for_shrink/finish_shrink and sets _destroyed at the end.
- Around line 681-691: The ratio_list method can divide by zero when total == 0;
update ratio_list (in _core.py) to check if total is zero before the division
loop and handle that case (for example, return the zero-filled list `ret`
immediately or otherwise avoid the division) so no ZeroDivisionError occurs;
keep the rest of the logic the same and reference the existing symbols:
ratio_list, num_pool_groups, self._pool_groups, total, and ret.
- Around line 395-402: In expand (method expand) add a guard to ensure you don't
discard an in-progress shrink: before resizing and setting
self._target_capacity, assert or raise unless self._target_capacity ==
self._capacity (the same check used in prepare_for_shrink), so expand
refuses/alerts if a shrink is active and thus avoids silently orphaning
self._overflow_slots; locate the symbols expand, _target_capacity, _capacity,
_overflow_slots and prepare_for_shrink and implement the check (or explicit
handling) to preserve or explicitly cancel any active shrink instead of
unconditionally resetting _target_capacity.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Line 346: The CDLL load is missing use_errno=True so ctypes.get_errno()
returns stale/zero values; update the _libc initialization (the ctypes.CDLL call
that assigns _libc) to pass use_errno=True so errno is captured for subsequent
calls in _mmap, _munmap, _mremap, and _madvise, ensuring those functions read
the correct error codes via ctypes.get_errno().

In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c`:
- Around line 94-103: ReferenceType_hash currently returns self->object_id
directly which can be -1, but tp_hash uses -1 to signal errors; update
ReferenceType_hash to detect when (Py_hash_t)self->object_id == -1 and remap
that value to -2 before returning. Locate the ReferenceType_hash function
operating on ReferenceTypeObject and ensure it still raises a RuntimeError when
self->object_id == 0, but after computing the hash cast to Py_hash_t, replace -1
with -2 to comply with the CPython hashing protocol.

🧹 Nitpick comments (13)

tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py (3)
264-265: Variable shadowing: loop variable Index shadows the module-level TypeVar.

The list comprehension [generator(Index) for Index in typed_range(count)] uses Index as the loop variable, which shadows the TypeVar("Index") defined at line 60. This works at runtime but violates the coding guideline to "avoid shadowing variables declared in an outer scope" and is confusing to readers.
Proposed fix
 def make_typed(generator: Callable[[Index], T], count: Index) -> TypedIndexList[Index, T]:
-    return cast(TypedIndexList[Index, T], [generator(Index) for Index in typed_range(count)])
+    return cast(TypedIndexList[Index, T], [generator(i) for i in typed_range(count)])
As per coding guidelines: "Avoid shadowing variables declared in an outer scope".

374-377: Missing stacklevel in warnings.warn.

Per Ruff B028, warnings.warn without stacklevel defaults to 1, which points to this helper function rather than the caller. Use stacklevel=2 so the warning points to the actual call site.
Proposed fix
         warnings.warn(
             f"madvise failed with errno {error_code}: {errno.errorcode.get(error_code, 'Unknown error')}"
+            , stacklevel=2
         )
406-410: Missing stacklevel=2 in _munmap warning.

Same issue as _madvise — the warning will attribute to this function rather than the caller.
Proposed fix
-        warnings.warn(f"munmap failed with errno {error_code}")
+        warnings.warn(f"munmap failed with errno {error_code}", stacklevel=2)
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py (1)
38-40: Average.value will raise ZeroDivisionError if called before any update().

Current usage always calls update() before accessing value, so this is safe in practice. However, a guard would make the class more robust.
Proposed defensive fix
     `@property`
     def value(self) -> float:
+        if self.count == 0:
+            return 0.0
         return self.sum / self.count
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py (1)

944-944: Address the FIXME: unused _ parameter in get_num_matched_tokens.

The # @fixme: remove the _ parameter comment suggests this is a known workaround. The function is called with matched as an argument (lines 978, 1001, 1026) but ignores it — it captures matched from the enclosing scope via closure instead.

Would you like me to open an issue to track cleaning up this parameter?
tests/unittest/kv_cache_manager_v2_tests/test_kv_cache_manager_v2.py (1)
1097-1098: Local GPU_LEVEL shadows the module-level import.

GPU_LEVEL is already imported at module scope (line 52/93). The local redefinition GPU_LEVEL = CacheLevel(0) creates the same value but shadows the import, which the coding guidelines recommend avoiding. HOST_LEVEL is fine since it's not imported.
Proposed fix
-        GPU_LEVEL = CacheLevel(0)
         HOST_LEVEL = CacheLevel(1)
         # Shrink the gpu quota
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py (1)
384-395: zip() without strict=True on ratio comparison.

Line 390 zips two TypedIndexList[PoolGroupIndex, float] without strict=True. These should always have equal length, so adding strict=True would catch bugs early (as Ruff B905 also flags).
Proposed fix
-            return any(not (1 / thres < x / y < thres) for x, y in zip(a, b))
+            return any(not (1 / thres < x / y < thres) for x, y in zip(a, b, strict=True))
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py (3)
647-650: Unused __init__ parameters total_quota and ratio_list.

CacheLevelStorage.__init__ accepts total_quota and ratio_list but never uses them. All subclasses pass these values via super().__init__(), creating a false impression that the base class stores or validates them. Consider either removing these parameters or adding validation logic here (e.g., asserting total_quota > 0, sum(ratio_list) ≈ 1.0).
Option A: Remove unused params
-    def __init__(self, total_quota: int, ratio_list: TypedIndexList[PoolGroupIndex, float]) -> None:
+    def __init__(self) -> None:
         if not hasattr(self.__class__, "TIER"):
             raise ValueError(f"{self.__class__.__name__} must define 'TIER' as a class variable")
This would require updating all super().__init__(total_quota, init_ratio) calls in subclasses.
429-451: finish_shrink return type is misleading — it returns True or raises, never False.

The -> bool return type suggests the caller can check success/failure, but the method always returns True on success and raises RuntimeError on failure. This can mislead callers into writing if not allocator.finish_shrink(): ... which is dead code.

Consider changing the return type to -> None (and removing return True), or to -> Literal[True], to accurately signal the contract.
Proposed fix (return None)
-    def finish_shrink(self) -> bool:
+    def finish_shrink(self) -> None:
         assert NDEBUG or self._check()
         if (
             self.shrink_in_progress
             and self._target_capacity + len(self._overflow_slots) == self._num_active_slots
         ):
             ...
             self._scrub_events()
             assert NDEBUG or self._check()
-            return True
+            return
         raise RuntimeError("shrink can't be finished")
489-493: Potential infinite loop in _synchronize.

_synchronize spins until all recycled slots are ready, but _scrub_events only advances _num_ready_recycled_slots by scanning forward from the current ready frontier and stopping at the first non-ready event. If a slot in the middle has a long-running or stuck CUDA event, this will busy-wait indefinitely with no timeout and no backoff.

This may be intentional for a synchronization primitive, but consider adding a safety timeout or at minimum a time.sleep to avoid CPU burn.
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py (3)
253-266: Exception handler in new_slots_for_pool_group has no cleanup — unlike new_slots.

In new_slots (line 244), the except block releases all partially allocated slots. Here (line 264), the except block only warns and re-raises. Since allocate_multiple is all-or-none and the assertion on line 261 should guarantee success, this is likely safe. But if the intent is a safety net, it should match the pattern in new_slots.

Also, per Ruff B028, warnings.warn should include stacklevel=2 so the warning points to the caller.
Proposed fix
         try:
             return storage.allocate_multiple(pg_idx, num_slots)
         except Exception:
-            warnings.warn("Exception not expected here. Please report a bug.")
+            warnings.warn("Exception not expected here. Please report a bug.", stacklevel=2)
             raise
573-621: shrink_pool_group — complex logic with subtle eviction/defrag interplay; a few observations.

Line 600-603: The while-loop condition len(overflow_slots) + num_overflow_persistent > min(new_num_slots, overflow_slots[0][0] + allocator.num_free_slots) relies on the eviction controller's iterator index (overflow_slots[0][0]) corresponding to the eviction order used by force_evict. If these orderings diverge, the computed min_num_evicted will be wrong, potentially leaving the shrink unable to complete. This coupling is fragile and deserves a comment or assertion.

Line 616-620: The assertion len(allocator._overflow_slots) == allocator._num_active_slots - allocator._target_capacity directly accesses allocator internals. Consider exposing a method on SlotAllocator (e.g., is_ready_to_finish_shrink()) to encapsulate this invariant check.
Suggestion: Add a clarifying comment for the eviction loop
         allocator.prepare_for_shrink(new_num_slots)
         min_num_evicted = 0
+        # Evict pages from the front of the eviction order until all remaining
+        # overflow pages (evictable + persistent) can be relocated into [0, new_num_slots).
         while overflow_slots and len(overflow_slots) + num_overflow_persistent > min(
             new_num_slots, overflow_slots[0][0] + allocator.num_free_slots
         ):
             min_num_evicted = overflow_slots.popleft()[0] + 1
631-663: adjust_cache_level shrinks before expanding — good memory hygiene.

Shrinking first frees resources that may be needed for expanding other pool groups. The round-up of new_quota on line 645 ensures alignment with pool size granularity.

One note: if shrink_pool_group fails (raises) for one pool group, subsequent groups won't be processed, leaving the cache level in a partially resized state. The caller should be aware of this non-atomic behavior. A brief docstring note would help.
Docstring enhancement
     def adjust_cache_level(
         self,
         level: CacheLevel,
         new_quota: int | None,
         new_ratio_list: TypedIndexList[PoolGroupIndex, float],
         persistent_pages: TypedIndexList[PoolGroupIndex, list[Page]] | None = None,
     ) -> None:
-        """Adapt the cache level by adjusting the ratio list. Persistent pages are those held and not evictable."""
+        """Adapt the cache level by adjusting the ratio list. Persistent pages are those held and not evictable.
+
+        Note: This operation is not atomic. If a shrink or expand fails for one pool group,
+        the cache level may be left in a partially resized state.
+        """

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py

+    def _gather_persistent_pages(self) -> TypedIndexList[PoolGroupIndex, list[Page]]:
+        last_level = self._storage.num_cache_levels - 1
+        lc2pg = self._storage._life_cycle_grouping
+        ret = make_typed(lambda _: list[Page](), self._storage.num_pool_groups)
+        for r in self._living_kv_caches:
+            kv_cache = unwrap_rawref(r)
+            assert kv_cache.status == _KVCache.Status.SUSPENDED
+            for block in kv_cache._blocks:
+                for beam in block.pages:
+                    for lc, holder in typed_enumerate(beam):
+                        if holder is None:
+                            continue
+                        assert type(holder) is _PageHolder
+                        page = holder.page
+                        assert page.status == PageStatus.HELD
+                        assert page.scheduled_for_eviction == (page.cache_level != last_level)
+                        if not page.scheduled_for_eviction:
+                            ret[lc2pg[lc]].append(holder.page)
+        return ret


⚠️ Potential issue | 🟠 Major

_gather_persistent_pages asserts all living caches are SUSPENDED.

This is correct for the adjust() path (line 438-439 also asserts this), but _adjust_level is also called from resize() (line 242) where caches may be ACTIVE. If level == num_cache_levels - 1 in a resize() call, _gather_persistent_pages would fire and could hit the assertion on line 413.

#!/bin/bash # Verify whether resize() can be called when caches are active and whether # the last-level path in _adjust_level is reachable from resize() rg -n '_adjust_level\|_gather_persistent_pages\|def resize' --type=py -g '*_kv_cache_manager*'

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py` around lines 407 - 425, _gather_persistent_pages incorrectly asserts all living caches are SUSPENDED which fails when _adjust_level is invoked from resize() while caches may be ACTIVE; change the behavior in _gather_persistent_pages (refer to function _gather_persistent_pages and symbol _living_kv_caches) to not assert kv_cache.status == _KVCache.Status.SUSPENDED—either skip entries where kv_cache.status != _KVCache.Status.SUSPENDED (continue) or explicitly handle ACTIVE caches (e.g., suspend them safely before iterating) so the function no longer raises on resize() calls that run this path.

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py

+        def ratio_from_length(
+            history_length: int, capacity: int
+        ) -> TypedIndexList[PoolGroupIndex, float]:
+            num_blocks = div_up(capacity, tokens_per_blocks)
+            num_bytes = filled_list(0.0, num_pool_groups)
+            for lc_idx, lc in typed_enumerate(life_cycles):
+                stale_beg, stale_end = _KVCache._get_stale_range(
+                    tokens_per_blocks, history_length, lc
+                )
+                pg_idx = lc2pg[lc_idx]
+                slot_size = storage.slot_size(pg_idx)
+                num_bytes[pg_idx] += (num_blocks - (stale_end - stale_beg)) * sum(slot_size)
+            total = sum(num_bytes)
+            return typed_map(num_bytes, lambda x: x / total)


⚠️ Potential issue | 🟡 Minor

Potential ZeroDivisionError in ratio_from_length when total == 0.

On line 470, x / total will raise ZeroDivisionError if all pool groups end up with zero effective bytes (e.g., all blocks are stale). While this is unlikely in practice for avg_capacity > 0, it's worth adding a guard.

Proposed guard

total = sum(num_bytes) + if total == 0: + return filled_list(1.0 / num_pool_groups, num_pool_groups) return typed_map(num_bytes, lambda x: x / total)

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py` around lines 457 - 470, In ratio_from_length, protect against total == 0 before computing x / total to avoid ZeroDivisionError: after computing total = sum(num_bytes) add a guard that if total == 0 return a TypedIndexList where each pool group gets an even share (e.g., 1.0 / num_pool_groups) or explicit zeros as appropriate for downstream code; implement this using typed_fill/filled_list or typed_map to produce the fallback list and otherwise proceed with return typed_map(num_bytes, lambda x: x / total). Ensure you reference ratio_from_length, num_bytes, total, num_pool_groups, and typed_map when making the change.

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py

+class MovingAverage:
+    __slots__ = ("decay", "avg", "weight", "num_updates")
+    decay: float
+    avg: float
+    weight: float
+    num_updates: int


⚠️ Potential issue | 🟡 Minor

Missing NVIDIA copyright header.

This is a new file and must include the NVIDIA Apache 2.0 copyright header. As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

Proposed fix

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + class MovingAverage:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class MovingAverage:

__slots__ = ("decay", "avg", "weight", "num_updates")

decay: float

avg: float

weight: float

num_updates: int

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

class MovingAverage:

__slots__ = ("decay", "avg", "weight", "num_updates")

decay: float

avg: float

weight: float

num_updates: int

🧰 Tools

🪛 Ruff (0.15.0)

[warning] 2-2: MovingAverage.__slots__ is not sorted

Apply a natural sort to MovingAverage.__slots__

(RUF023)

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_core/_moving_average.py` around lines 1 - 6, This file is missing the required NVIDIA Apache-2.0 copyright header; add the standard NVIDIA copyright and Apache-2.0 license header (with the year of latest meaningful modification) at the top of the file before the class MovingAverage definition so the file contains the required license text and copyright attribution.

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py

+    def expand(self, new_num_slots: int) -> None:
        assert NDEBUG or self._check()
-        old_num_slots = self.num_slots
-        if new_num_slots < self.num_slots and self._occupied_mask.any_set(
-            new_num_slots, self.num_slots
-        ):
-            raise ResourceBusyError("resize cannot remove occupied slots")
+        old_num_slots = self._capacity
+        assert new_num_slots > old_num_slots
        self._occupied_mask.resize(new_num_slots)
        self._capacity = new_num_slots
-        self._num_active_slots = min(self._num_active_slots, new_num_slots)
-        if new_num_slots < old_num_slots:
-            new_recycled_slots = deque[Slot]()
-            new_num_ready_recycled_slots = 0
-            for idx_recycled, slot in enumerate(self._recycled_slots):
-                assert type(slot) is Slot and slot.has_valid_slot
-                if slot.slot_id >= new_num_slots:
-                    slot.ready_event.synchronize()
-                    slot._slot_id = None
-                    slot.ready_event = CachedCudaEvent.NULL
-                else:
-                    new_recycled_slots.append(slot)
-                    if idx_recycled < self._num_ready_recycled_slots:
-                        new_num_ready_recycled_slots += 1
-            self._recycled_slots = new_recycled_slots
-            self._num_ready_recycled_slots = new_num_ready_recycled_slots
-            self._scrub_events()
        self._target_capacity = self._capacity
        assert NDEBUG or self._check()


⚠️ Potential issue | 🟠 Major

expand resets _target_capacity — verify no in-progress shrink is active.

expand unconditionally sets _target_capacity = _capacity on line 401, but there's no assertion that _target_capacity == _capacity before expanding (unlike prepare_for_shrink which asserts it on line 406). If expand is called while a shrink is in progress, it would silently discard the shrink state and orphan any _overflow_slots.

Proposed fix: add a guard

def expand(self, new_num_slots: int) -> None: assert NDEBUG or self._check() old_num_slots = self._capacity assert new_num_slots > old_num_slots + assert not self.shrink_in_progress, "Cannot expand while a shrink is in progress" self._occupied_mask.resize(new_num_slots) self._capacity = new_num_slots self._target_capacity = self._capacity assert NDEBUG or self._check()

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 395 - 402, In expand (method expand) add a guard to ensure you don't discard an in-progress shrink: before resizing and setting self._target_capacity, assert or raise unless self._target_capacity == self._capacity (the same check used in prepare_for_shrink), so expand refuses/alerts if a shrink is active and thus avoids silently orphaning self._overflow_slots; locate the symbols expand, _target_capacity, _capacity, _overflow_slots and prepare_for_shrink and implement the check (or explicit handling) to preserve or explicitly cancel any active shrink instead of unconditionally resetting _target_capacity.

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py

    def destroy(self) -> None:
-        if self._slot_allocator._capacity == 0:
+        if self._destroyed:
+            return
+        allocator = self._slot_allocator
+        if allocator._capacity == 0:
            return
-        self._slot_allocator._synchronize()
+        allocator._synchronize()
        for pool in self._pools:
            pool.destroy()
-        self._slot_allocator.resize(0)
+        allocator.prepare_for_shrink(0)
+        allocator.finish_shrink()
+        self._destroyed = True


⚠️ Potential issue | 🟡 Minor

destroy() skips pool cleanup when _capacity == 0.

Line 513-514: if allocator._capacity == 0, destroy() returns before destroying pools or marking _destroyed = True. On a subsequent call, _destroyed is still False, so it enters again and hits the same early return — pools are never destroyed and _destroyed is never set. This can leak host/disk resources in edge cases where a PoolGroupBase was constructed with num_slots=0.

Proposed fix

def destroy(self) -> None: if self._destroyed: return allocator = self._slot_allocator if allocator._capacity == 0: + for pool in self._pools: + pool.destroy() + self._destroyed = True return allocator._synchronize() for pool in self._pools: pool.destroy() allocator.prepare_for_shrink(0) allocator.finish_shrink() self._destroyed = True

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 509 - 520, The destroy() method currently returns early when allocator._capacity == 0 and therefore never destroys the pools or sets self._destroyed; change the logic in destroy() (method name: destroy, fields: _slot_allocator, _pools, _destroyed, allocator._capacity) so that when capacity == 0 you skip allocator-specific calls (synchronize/prepare_for_shrink/finish_shrink) but still iterate over self._pools and call pool.destroy() and then set self._destroyed = True before returning; ensure the normal path still does allocator._synchronize() and allocator.prepare_for_shrink/finish_shrink and sets _destroyed at the end.

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py

+    def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]:
+        num_pool_groups = self.num_pool_groups
+        ret = filled_list(0.0, num_pool_groups)
+        total = 0
+        for i, pg in typed_enumerate(self._pool_groups):
+            size = pg.num_bytes
+            total += size
+            ret[i] = size
+        for i in typed_range(num_pool_groups):
+            ret[i] /= total
+        return ret


⚠️ Potential issue | 🟠 Major

Division by zero when all pools have zero bytes.

If total is 0 (e.g., after a destroy() or when all pool groups are empty), ret[i] /= total on line 690 will raise ZeroDivisionError.

Proposed fix

for i, pg in typed_enumerate(self._pool_groups): size = pg.num_bytes total += size ret[i] = size - for i in typed_range(num_pool_groups): - ret[i] /= total + if total > 0: + for i in typed_range(num_pool_groups): + ret[i] /= total + else: + # Equal ratios when no bytes are allocated + uniform = 1.0 / int(num_pool_groups) if int(num_pool_groups) > 0 else 0.0 + for i in typed_range(num_pool_groups): + ret[i] = uniform return ret

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]:

num_pool_groups = self.num_pool_groups

ret = filled_list(0.0, num_pool_groups)

total = 0

for i, pg in typed_enumerate(self._pool_groups):

size = pg.num_bytes

total += size

ret[i] = size

for i in typed_range(num_pool_groups):

ret[i] /= total

return ret

def ratio_list(self) -> TypedIndexList[PoolGroupIndex, float]:

num_pool_groups = self.num_pool_groups

ret = filled_list(0.0, num_pool_groups)

total = 0

for i, pg in typed_enumerate(self._pool_groups):

size = pg.num_bytes

total += size

ret[i] = size

if total > 0:

for i in typed_range(num_pool_groups):

ret[i] /= total

else:

# Equal ratios when no bytes are allocated

uniform = 1.0 / int(num_pool_groups) if int(num_pool_groups) > 0 else 0.0

for i in typed_range(num_pool_groups):

ret[i] = uniform

return ret

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py` around lines 681 - 691, The ratio_list method can divide by zero when total == 0; update ratio_list (in _core.py) to check if total is zero before the division loop and handle that case (for example, return the zero-filled list `ret` immediately or otherwise avoid the division) so no ZeroDivisionError occurs; keep the rest of the logic the same and reference the existing symbols: ratio_list, num_pool_groups, self._pool_groups, total, and ret.

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py

+MREMAP_MAYMOVE: Final[int] = 1
+MAP_FAILED: Final[int] = -1

 _libc = ctypes.CDLL(find_library("c"))


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file and find the context around line 346 head -400 tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py | tail -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 3048

🏁 Script executed:

# Search for all uses of ctypes.get_errno() in the file rg "ctypes\.get_errno|get_errno" tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py -n

Repository: NVIDIA/TensorRT-LLM

Length of output: 238

🏁 Script executed:

# Check the entire file structure to understand the functions mentioned rg "_mmap|_munmap|_mremap|_madvise" tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py -n -A 5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2113

🌐 Web query:

Python ctypes CDLL use_errno parameter get_errno behavior

💡 Result:

ctypes.CDLL(..., use_errno=True) (and similarly CFUNCTYPE(..., use_errno=True)) does not make ctypes.get_errno() read the process’s live C errno directly. Instead, it enables an errno swapping mechanism around that specific ctypes call:

ctypes keeps a thread-local “private copy” of errno.

For a foreign function invoked via ctypes that was created with use_errno=True:

Before the call: the OS/C errno is swapped with ctypes’ private copy.

After the call: they’re swapped again, so ctypes’ private copy now holds the errno value produced by the C call.
Then ctypes.get_errno() returns that private copy. [1]

So the practical behavior is:

If you call a ctypes function without use_errno=True, ctypes will typically not capture/update its private errno copy for that call, and ctypes.get_errno() may show an old value. [1]

ctypes.get_errno() will not reflect errno set by non-ctypes operations (e.g., Python’s os.open()), because it reads ctypes’ private copy, not the live errno. [1][2]

Setting use_errno=True on the library (CDLL) tends to apply as the default for functions you access as attributes from that library object (and adds overhead to each call), whereas you can also enable it only for specific callables via function prototypes. [1][2]

Sources: Python ctypes docs on use_errno, get_errno(), and the swap mechanism [1]; explanation/usage notes illustrating that get_errno() reads ctypes’ private copy (and needs use_errno=True on the ctypes call) [2].

[1] (docs.python.org)
[2] (stackoverflow.com)

Citations:

1: https://docs.python.org/3/library/ctypes.html?utm_source=openai

2: https://stackoverflow.com/questions/78966115/how-to-correctly-use-ctypes-get-errno?utm_source=openai

Critical bug: ctypes.CDLL missing use_errno=True — ctypes.get_errno() will always return stale/zero values.

Without use_errno=True, ctypes does not capture errno values from C function calls. Since the code calls ctypes.get_errno() in _mmap, _munmap, _mremap, and _madvise to retrieve error codes, all error diagnostics will be unreliable and display incorrect values.

Proposed fix

-_libc = ctypes.CDLL(find_library("c")) +_libc = ctypes.CDLL(find_library("c"), use_errno=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_libc = ctypes.CDLL(find_library("c"))

_libc = ctypes.CDLL(find_library("c"), use_errno=True)

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py` at line 346, The CDLL load is missing use_errno=True so ctypes.get_errno() returns stale/zero values; update the _libc initialization (the ctypes.CDLL call that assigns _libc) to pass use_errno=True so errno is captured for subsequent calls in _mmap, _munmap, _mremap, and _madvise, ensuring those functions read the correct error codes via ctypes.get_errno().

coderabbitai · 2026-02-13T06:02:33Z

tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c

+/* ReferenceType.__hash__ */
+static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)
+{
+    if (self->object_id == 0)
+    {
+        PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");
+        return -1;
+    }
+    return (Py_hash_t) self->object_id;
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

CPython tp_hash protocol -1 return value error convention

💡 Result:

In CPython, a type’s tp_hash slot has the same contract as PyObject_Hash():

-1 is reserved to mean “error”. A tp_hash implementation must not return -1 as a normal hash value; if hashing fails, it must set an exception and return -1. [1]

Correspondingly, PyObject_Hash(o) returns -1 on failure (with an exception set), and PyObject_HashNotImplemented(o) sets TypeError (“unhashable type”) and returns -1 (and is specially recognized when stored in tp_hash). [2]

Because -1 is the error sentinel, CPython ensures a “real” hash is never -1: if the computed hash would be -1, CPython remaps it (commonly to -2) in core hash implementations (e.g., the integer hash code path shown here). [3]

Practical rule when writing tp_hash: if your computed hash equals -1, return -2 (or any other non--1 value) instead; reserve -1 strictly for signaling an exception. [1][3]

🌐 Web query:

CPython long_hash unicode_hash -1 handling implementation

💡 Result:

In CPython, Py_hash_t returning -1 is reserved to mean “error occurred” in the tp_hash / PyObject_Hash() protocol, so CPython must ensure no successful hash computation ever returns -1.

long_hash (hash of int)

Objects/longobject.c:long_hash() computes an integer hash and then rewrites -1 to -2 before returning: if (x == (Py_uhash_t)-1) x = (Py_uhash_t)-2;. This is why hash(-1) == -2 on CPython. [1]

unicode_hash (hash of str)

Objects/unicodeobject.c:unicode_hash() uses the cached per-string hash value; the cache sentinel for “not yet computed” is -1 (if (hash != -1) return hash;). When the hash is computed, it calls the core bytes/string hashing routine (Py_HashBuffer / _Py_HashBytes) and stores the result. [2]

Where the “never return -1” rule is enforced for string/bytes hashing

The core helper _Py_HashBytes() explicitly ensures a computed hash value of -1 is converted to -2 (it returns -2 in that case). [3]

If a Python-level __hash__ returns -1

Even for user-defined classes, the slot wrapper in Objects/typeobject.c applies the same rule: “-1 is reserved for errors” and is rewritten to -2. [4]

Sources:
[1] Objects/longobject.c (long_hash) showing -1 → -2 remap. (chromium.googlesource.com)
[2] Objects/unicodeobject.c (unicode_hash) showing cached-hash sentinel -1 and computing via Py_HashBuffer / _Py_HashBytes. (chromium.googlesource.com)
[3] Python/pyhash.c (_Py_HashBytes) showing -1 → -2 handling. (sources.debian.org)
[4] Objects/typeobject.c (slot_tp_hash) rewriting h == -1 to -2. (chromium.googlesource.com)

Citations:

1: https://chromium.googlesource.com/external/github.com/python/cpython/%2B/refs/tags/v3.7.17/Objects/longobject.c

2: https://chromium.googlesource.com/external/github.com/python/cpython/%2B/refs/heads/main/Objects/unicodeobject.c?utm_source=openai

3: https://sources.debian.org/src/python3.7/3.7.3-2%2Bdeb10u3/Python/pyhash.c/?utm_source=openai

4: https://chromium.googlesource.com/external/github.com/python/cpython/%2B/refs/tags/v3.10.0a2/Objects/typeobject.c?utm_source=openai

🏁 Script executed:

# Search for real examples of how CPython hash functions handle -1 # Let's check if there are any relevant Python documentation or examples python3 <<'EOF' import sys print(f"Python version: {sys.version}") # Test if -1 is special in hash try: # Create a custom object with hash returning -1 class BadHash: def __hash__(self): return -1 obj = BadHash() try: h = hash(obj) print(f"hash() with -1 return: {h}") except Exception as e: print(f"hash() with -1 return raises: {type(e).__name__}: {e}") except Exception as e: print(f"Error in test: {e}") # Check if -1 is actually reserved print(f"\nChecking -1 as hash value:") d = {} try: d[-1] = "test" print(f"Can use -1 as dict key: yes") print(f"Value: {d[-1]}") except Exception as e: print(f"Cannot use -1 as dict key: {e}") EOF

Repository: NVIDIA/TensorRT-LLM

Length of output: 251

Hash function doesn't handle the reserved −1 return value.

In CPython, tp_hash returning -1 is reserved to signal an error. If object_id equals -1 (a valid pointer cast in some scenarios), this function will violate the CPython protocol — Python will interpret the return as an error, causing silent failures. CPython's hash implementations (long_hash, unicode_hash, _Py_HashBytes) handle this by remapping -1 to -2.

Proposed fix

static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self) { if (self->object_id == 0) { PyErr_SetString(PyExc_RuntimeError, "Reference is invalid"); return -1; } - return (Py_hash_t) self->object_id; + Py_hash_t h = (Py_hash_t) self->object_id; + if (h == -1) + h = -2; + return h; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/* ReferenceType.__hash__ */

static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)

{

if (self->object_id == 0)

{

PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");

return -1;

}

return (Py_hash_t) self->object_id;

}

/* ReferenceType.__hash__ */

static Py_hash_t ReferenceType_hash(ReferenceTypeObject* self)

{

if (self->object_id == 0)

{

PyErr_SetString(PyExc_RuntimeError, "Reference is invalid");

return -1;

}

Py_hash_t h = (Py_hash_t) self->object_id;

if (h == -1)

h = -2;

return h;

}

🤖 Prompt for AI Agents

In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c` around lines 94 - 103, ReferenceType_hash currently returns self->object_id directly which can be -1, but tp_hash uses -1 to signal errors; update ReferenceType_hash to detect when (Py_hash_t)self->object_id == -1 and remap that value to -2 before returning. Locate the ReferenceType_hash function operating on ReferenceTypeObject and ensure it still raises a RuntimeError when self->object_id == 0, but after computing the hash cast to Py_hash_t, replace -1 with -2 to comply with the CPython hashing protocol.

tensorrt-cicd · 2026-02-13T09:31:46Z

PR_Github #35878 [ run ] completed with state SUCCESS. Commit: 50c5697
/LLM/main/L0_MergeRequest_PR pipeline #27708 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer added 4 commits February 13, 2026 05:47

Update rawref to support hashing

39ba7ec

Signed-off-by: Yao Yao <[email protected]>

Implement dynamic resize for kv cache manager v2

31e4370

Signed-off-by: Yao Yao <[email protected]>

Add test for gpu quota resize

f5f816b

Signed-off-by: Yao Yao <[email protected]>

Fix tests for dynamic resize and transceiver interface

50c5697

Signed-off-by: Yao Yao <[email protected]>

lowsfer self-assigned this Feb 13, 2026

lowsfer requested review from eopXD and yizhang-nv February 13, 2026 05:54

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

	_libc = ctypes.CDLL(find_library("c"))
	_libc = ctypes.CDLL(find_library("c"), use_errno=True)

[TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 #11503

Are you sure you want to change the base?

[TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 #11503

Uh oh!

Conversation

lowsfer commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

lowsfer commented Feb 13, 2026

Uh oh!

tensorrt-cicd commented Feb 13, 2026

Uh oh!

coderabbitai bot commented Feb 13, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

long_hash (hash of int)

unicode_hash (hash of str)

Where the “never return -1” rule is enforced for string/bytes hashing

If a Python-level __hash__ returns -1

Uh oh!

tensorrt-cicd commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lowsfer commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

`long_hash` (hash of `int`)

`unicode_hash` (hash of `str`)

If a Python-level `hash` returns `-1`