[TRTLLM-11513][fix] fix iterator undefined behavior in WindowBlockManager::getFreeBlock offload path by thorjohnsen · Pull Request #12297 · NVIDIA/TensorRT-LLM

thorjohnsen · 2026-03-17T21:47:16Z

Summary

In WindowBlockManager::getFreeBlock, the KV-cache offload path called swapMemoryPoolBlockOffset() before calling claimBlock() on either block. After the swap, getCacheLevel() returns the post-swap cache level, but mFreeBlockIterators[id] still points into the pre-swap queue. claimBlock() then calls .erase() on the wrong std::list — undefined behaviour per C++17 §26.3.10.4.
In practice this silently corrupts mNumFreeBlocksPerLevel and the free-list structure in LRUEvictionPolicy, causing block counts to diverge from reality and eventually producing "No free block found" aborts.
Fix: claim both block (primary) and offloadBlock (secondary) before swapMemoryPoolBlockOffset(), so each claimBlock() call sees the cache level that matches the iterator. This is identical to the ordering already used correctly by WindowBlockManager::offloadBlock() (line 1121).

Root cause (detailed)

LRUEvictionPolicy::claimBlock() (evictionPolicy.cpp):

SizeType32 const cacheLevel = getCacheLevel(block);  // reads isPrimary() — reflects post-swap level
mFreeQueues[cacheLevel][getPriorityIdx(...)].erase(*mFreeBlockIterators[id]);  // iterator is from pre-swap level

Both block and offloadBlock are obtained from the eviction policy's free queues and therefore have valid mFreeBlockIterators. After the swap the iterator/level mismatch affects both of them.

API cleanup

I have also updated the API so that it no longer matters whether claimBlock is called before or after swapMemoryPoolBlockOffset. With the cleaned up API, the above fix is no longer necessary, but I think it is the correct way of doing things. We should claim the blocks before changing their state with swapMemoryPoolBlockOffset.

Test plan

Enable KV-cache offloading (--kv_cache_offload_gpu_to_cpu_frac or equivalent) and run multi-request inference — previously this path would eventually corrupt the free-list counters leading to a crash or incorrect "no free blocks" error.
Run existing tests/unittest/batch_manager/test_kv_cache_manager.* suite.
Verify with ASan/UBSan that no std::list iterator misuse is reported in LRUEvictionPolicy::claimBlock.

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Improved KV cache memory block management efficiency during offloading and swapping operations, enhancing resource allocation and ownership tracking.

coderabbitai · 2026-03-17T21:51:14Z

📝 Walkthrough

Walkthrough

The getFreeBlock function in the KV cache manager is modified to claim both blocks from the eviction policy before performing the swap operation, then release the secondary block directly into its queue post-swap. This ensures cache level ownership accurately reflects block states throughout the operation.

Changes

Cohort / File(s)	Summary
KV Cache Manager Block Claiming Logic `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Modified `getFreeBlock` to claim blocks before swapping and release the secondary block directly after swap without intermediate claim/release cycles, ensuring correct eviction policy state during offload and swap operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title directly describes the main fix: addressing undefined behavior in the WindowBlockManager::getFreeBlock offload path, which is the core change in the kvCacheManager.cpp file.
Description check	✅ Passed	PR description comprehensively covers the issue, root cause, fix, and test plan, aligning well with the template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nvpohanh · 2026-03-19T02:47:22Z

@eopXD could you review this? thanks

…fload path claimBlock() reads getCacheLevel(block) which is based on isPrimary(). In the offload path, swapMemoryPoolBlockOffset() was called before claimBlock(), so getCacheLevel() returned the post-swap level while mFreeBlockIterators[id] still pointed into the pre-swap queue. Erasing via a std::list::iterator into a different std::list instance is undefined behaviour (C++17 §26.3.10.4); in practice it silently corrupts mNumFreeBlocksPerLevel and the free-list structure. Fix: claim both blocks before the swap, matching the correct ordering already used by WindowBlockManager::offloadBlock() (line 1121). Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

thorjohnsen · 2026-03-23T15:09:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-23T15:16:01Z

PR_Github #39952 [ run ] triggered by Bot. Commit: 8d1d0fd Link to invocation

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen · 2026-03-23T16:02:05Z

/bot kill

tensorrt-cicd · 2026-03-23T16:08:05Z

PR_Github #39957 [ kill ] triggered by Bot. Commit: 98abb50 Link to invocation

tensorrt-cicd · 2026-03-23T16:08:49Z

PR_Github #39957 [ kill ] completed with state SUCCESS. Commit: 98abb50
Successfully killed previous jobs for commit 98abb50

Link to invocation

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

…BlockOffset Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen · 2026-03-23T21:56:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-23T22:02:22Z

PR_Github #39983 [ run ] triggered by Bot. Commit: dcf2408 Link to invocation

tensorrt-cicd · 2026-03-23T23:03:46Z

PR_Github #39983 [ run ] completed with state FAILURE. Commit: dcf2408
/LLM/main/L0_MergeRequest_PR pipeline #31143 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen · 2026-03-24T02:44:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-24T02:49:54Z

PR_Github #40026 [ run ] triggered by Bot. Commit: a5f5326 Link to invocation

tensorrt-cicd · 2026-03-24T09:59:02Z

PR_Github #40026 [ run ] completed with state SUCCESS. Commit: a5f5326
/LLM/main/L0_MergeRequest_PR pipeline #31183 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

thorjohnsen · 2026-03-24T14:34:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-24T14:40:26Z

PR_Github #40134 [ run ] triggered by Bot. Commit: c1425aa Link to invocation

tensorrt-cicd · 2026-03-24T14:45:17Z

PR_Github #40134 [ run ] completed with state FAILURE. Commit: c1425aa

Link to invocation

thorjohnsen · 2026-03-24T14:49:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-24T14:56:29Z

PR_Github #40137 [ run ] triggered by Bot. Commit: 06813ab Link to invocation

eopXD

Looks good, but some test coverage can help illustrate the change and extra security.

…n LRUEvictionPolicy Add ClaimAfterSwapDoesNotCorruptQueues to LRUPolicyTest to guard against reintroduction of the iterator UB fixed in PR NVIDIA#12297. The test releases a primary and a secondary block into their respective free queues, then calls swapMemoryPoolBlockOffset() on both (flipping isPrimary()) before calling claimBlock(). With the old bare-iterator approach this would erase from the wrong std::list; with the fix (stored cacheLevel tuple) the counts and queue integrity remain correct. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

tensorrt-cicd · 2026-03-24T23:37:07Z

PR_Github #40137 [ run ] completed with state SUCCESS. Commit: 06813ab
/LLM/main/L0_MergeRequest_PR pipeline #31283 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

eopXD · 2026-03-25T05:55:29Z

/bot run --disable-fail-fast

thorjohnsen · 2026-03-25T20:47:20Z

/bot run

tensorrt-cicd · 2026-03-25T20:53:10Z

PR_Github #40372 [ run ] triggered by Bot. Commit: e32f060 Link to invocation

tensorrt-cicd · 2026-03-26T05:32:37Z

PR_Github #40372 [ run ] completed with state FAILURE. Commit: e32f060
/LLM/main/L0_MergeRequest_PR pipeline #31471 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

thorjohnsen · 2026-03-28T17:55:44Z

/bot run

tensorrt-cicd · 2026-03-28T18:12:47Z

PR_Github #40551 [ run ] triggered by Bot. Commit: e32f060 Link to invocation

tensorrt-cicd · 2026-03-28T18:12:57Z

PR_Github #40551 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/28.

Link to invocation

thorjohnsen · 2026-03-30T13:24:46Z

/bot run

tensorrt-cicd · 2026-03-30T13:31:03Z

PR_Github #40732 [ run ] triggered by Bot. Commit: e32f060 Link to invocation

thorjohnsen requested a review from a team as a code owner March 17, 2026 21:47

github-actions bot assigned thorjohnsen Mar 17, 2026

thorjohnsen changed the title ~~batch_manager: fix iterator UB in WindowBlockManager::getFreeBlock offload path~~ [TRTLLM-11513][fix] fix iterator undefined behavior in WindowBlockManager::getFreeBlock offload path Mar 18, 2026

thorjohnsen force-pushed the thorjohnsen/fix_block_leak_bugs branch from fa19cda to 8d1d0fd Compare March 19, 2026 15:44

thorjohnsen marked this pull request as draft March 23, 2026 15:03

thorjohnsen marked this pull request as ready for review March 23, 2026 15:10

thorjohnsen marked this pull request as draft March 23, 2026 15:16

thorjohnsen added 2 commits March 23, 2026 10:22

Merge branch 'main' into thorjohnsen/fix_block_leak_bugs

177fb9e

Bug fix

98abb50

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

thorjohnsen added 10 commits March 23, 2026 16:10

Remove dead code

63dc76c

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Record complete info so claimBlock can be called after swapMemoryPool…

937d8a9

…BlockOffset Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

precommit run

226d31a

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Cleaner API

f0c7a69

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

precommit run

4fa25d6

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Fix compile issue

d6304b5

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Disallow calling claimBlock on block that has not been released

acc01af

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

precommit run

975f3e7

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Fix compile issue

b372b94

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Merge branch 'main' into thorjohnsen/fix_block_leak_bugs

dcf2408

thorjohnsen marked this pull request as ready for review March 23, 2026 21:56

thorjohnsen added 2 commits March 23, 2026 23:15

Allow for claimBlock to be called for blocks that are not in free queue

5edf72f

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Remove unused variable

a5f5326

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>

Merge branch 'main' into thorjohnsen/fix_block_leak_bugs

c1425aa

Merge branch 'main' into thorjohnsen/fix_block_leak_bugs

06813ab

thorjohnsen added the KV-Cache Management kv-cache management for efficient LLM inference label Mar 24, 2026

eopXD requested changes Mar 24, 2026

View reviewed changes

thorjohnsen requested a review from eopXD March 24, 2026 19:09

eopXD approved these changes Mar 25, 2026

View reviewed changes

thorjohnsen enabled auto-merge (squash) March 25, 2026 20:47

nvpohanh approved these changes Mar 30, 2026

View reviewed changes

Conversation

thorjohnsen commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause (detailed)

API cleanup

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

nvpohanh commented Mar 19, 2026

Uh oh!

thorjohnsen commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

thorjohnsen commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

thorjohnsen commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

thorjohnsen commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

thorjohnsen commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

thorjohnsen commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

eopXD commented Mar 25, 2026

Uh oh!

thorjohnsen commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

thorjohnsen commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

tensorrt-cicd commented Mar 28, 2026

Uh oh!

thorjohnsen commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

thorjohnsen commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading