Skip to content

[TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager#12531

Open
VALLIS-NERIA wants to merge 3 commits intoNVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/kv_cache_linear_reuse
Open

[TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager#12531
VALLIS-NERIA wants to merge 3 commits intoNVIDIA:mainfrom
VALLIS-NERIA:user/xiweny/kv_cache_linear_reuse

Conversation

@VALLIS-NERIA
Copy link
Collaborator

@VALLIS-NERIA VALLIS-NERIA commented Mar 25, 2026

This is a part of TRTLLM-10061. This PR only adds feature to the KV cache manager but does not integrate it with runtime or any model. The runtime and models integration will be done in a separate PR.

Summary

  • Extend C++ KVCacheManager and BlockManager to handle linear attention (Mamba2, GDN, etc.) state blocks
    • Add eviction policy support for placeholder blocks
    • Add LinearAttentionMetadata
    • Adjusted some signatures
    • Add copyLinearAttentionBlock that must be called every iteration
  • Add getTokenCount to address LlmRequest and GenerationRequest lose sync on num of tokens when overlap scheduler is used
  • StoreContextBlocks now only stores finished tokens when context is chunked
  • Add comprehensive unit tests for linear block reuse scenarios

Test plan

  • Existing KV cache manager unit tests pass
  • New linear block reuse unit tests pass

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for linear attention with recurrent state caching, including placeholder block allocation and management.
    • Introduced cache pool layer-first layout optimization for improved memory organization.
    • Added token count tracking API for individual requests.
  • Bug Fixes

    • Improved cache block eviction policy to properly handle placeholder blocks with negative indices.
    • Enhanced cache validation by introducing null/invalid sentinel handling for cache indices.

… manager

Add block reuse support for hybrid linear (Mamba) models in the C++ KV
cache manager. This enables prefix caching for recurrent state blocks
alongside standard KV cache blocks, allowing shared prefixes to be
reused across requests.

Key changes:
- Extend KVCacheManager and BlockManager to handle linear state blocks
  with configurable tokens-per-block for recurrent layers
- Add eviction policy support for linear cache blocks
- Expose linear cache configuration through nanobind bindings
- Add comprehensive unit tests for linear block reuse scenarios

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
@VALLIS-NERIA VALLIS-NERIA requested a review from a team as a code owner March 25, 2026 07:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

This pull request introduces linear attention (recurrent states) support for KV cache management. Key changes include a new MaybePlaceholderLRUEvictionPolicy for managing placeholder blocks, updated manager constructors accepting LinearAttentionMetadata, modified KVCacheIndex with null sentinel support, layer-first tensor layout options, and new APIs for copying linear attention blocks and querying token counts.

Changes

Cohort / File(s) Summary
Eviction Policy Extensions
cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h, cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp
Added MaybePlaceholderLRUEvictionPolicy class to manage placeholder blocks with negative IDs separately from primary blocks. Modified BaseEvictionPolicy::initialize and ::getFreeBlock signatures to use blocksPerCacheLevel parameter and accept wantPlaceholder flag. Introduced blockIdx virtual method for ID-to-index mapping.
KV Cache Index Updates
cpp/include/tensorrt_llm/kernels/kvCacheIndex.h
Added kInvalidPoolIndex sentinel and nullIndex static constant to represent invalid pool indices. Introduced isNull() query method for index validation.
KV Cache Manager Headers
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Added LinearAttentionMetadata struct and WindowSizeType alias. Extended constructors for WindowBlockManager, BlockManager, and KVCacheManager to accept optional linearAttentionMetadata. Updated KVCacheBlock constructor and placeholder creation to include windowSize parameter. Changed getBlockById return type from const& to value. Added new methods: copyLinearAttentionBlock, getTokenCount, isPoolLayerFirst, and placeholder accessors. Modified calculateMaxNumBlocks and calculateCacheSizePerTokenForSingleWindowSize signatures to accept numKvHeadsPerLayer and sizePerHead directly instead of ModelConfig.
KV Cache Manager Implementation
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Implemented linear attention support with placeholder block pre-allocation using MaybePlaceholderLRUEvictionPolicy. Added layer-first vs block-first tensor layout handling with conditional pool indexing. Implemented detachPreviousPlaceholdersFromLookupTree() for placeholder cleanup. Extended block allocation and management to support negative (placeholder) block IDs. Updated offset computation, context/chunk storage logic, and onboard/offload flows to exclude placeholders.
KV Cache Transfer Manager
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
Added branching logic for layer-first pool layouts in DRAM mode with per-layer tensor slicing. Rejected layerFirstLayout pools for file-based offload/onboard operations via runtime check.
Configuration and Validation
cpp/tensorrt_llm/executor/kvCacheConfig.cpp, cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
Updated setMaxAttentionWindowVec validation to allow recurrent states sentinel value alongside positive integers. Refactored KV cache block capacity calculation to use new calculateMaxNumBlocks signature with numKvHeadsPerLayer, sizePerHead, and tokensPerBlock parameters.
Python Bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Exposed LinearAttentionMetadata and LinearCacheType enum to Python. Added get_token_count() and get_recurrent_states_pool() methods. Updated calculate_max_num_blocks() signature and get_primary_pool_data() indexing behavior for layer-first layouts. Extended KVCacheManager constructor with linear_attention_metadata parameter.
Unit Tests
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp, cpp/tests/unit_tests/batch_manager/radixBlockTreeTest.cpp
Added comprehensive linear attention test suites (LinearAttentionContextNoReuseTest, LinearAttentionContextReuseTest, LinearAttentionDecodingBlockGrowthTest, LinearAttentionBlockCopyingTest) covering placeholder allocation, block growth, and copying behavior. Updated existing tests to set context position and pass required parameters. Updated placeholder creation call with windowSize argument.

Sequence Diagram

sequenceDiagram
    participant App as Application/Request
    participant KVM as KVCacheManager
    participant BM as BlockManager
    participant LAM as LinearAttentionMetadata
    participant EV as MaybePlaceholderLRUEvictionPolicy
    participant PEV as PlaceholderInnerLRUEvictionPolicy
    participant Pool as RecurrentStatesPool

    App->>KVM: copyLinearAttentionBlock(llmRequest)
    KVM->>BM: copyLinearAttentionBlock(sequence, llmRequest)
    
    alt Need Placeholder Blocks
        BM->>EV: getFreeBlock(cacheLevel, wantPlaceholder=true)
        alt Is Placeholder Request
            EV->>PEV: getFreeBlock (delegate to placeholder policy)
            PEV->>PEV: blockIdx (map negative ID)
            PEV-->>EV: return placeholder block
        end
        EV-->>BM: BlockPtr (placeholder)
    end
    
    BM->>Pool: store/manage recurrent state blocks
    BM->>App: update generation state with copied blocks
    
    alt Block Released
        App->>EV: releaseBlock(placeholder)
        EV->>EV: getCacheLevel (returns 0 for placeholder)
        alt Is Placeholder
            EV->>PEV: releaseBlock (delegate)
            PEV->>PEV: blockIdx (map negative ID to queue index)
            PEV->>PEV: update placeholder queue
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive PR description lacks structured sections required by template. Missing explicit details on changes, testing rationale, and PR checklist verification. Expand description with concrete examples of API changes, detailed test coverage rationale, and explicitly verify PR checklist items (dependencies, CODEOWNERS, documentation, design diagrams).
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature: adding linear attention state support for KV cache manager, which aligns with the primary changes across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to include 2026.

The copyright year range ends at 2024, but significant modifications are being made in 2026. As per coding guidelines, update the year to reflect the latest meaningful modification.

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h` at line 2, Update
the copyright header in tensorrt_llm/batch_manager/evictionPolicy.h by changing
the year range from "2022-2024" to "2022-2026" so the header reflects the latest
2026 modifications; locate the top-of-file copyright comment and replace the end
year accordingly.
cpp/include/tensorrt_llm/kernels/kvCacheIndex.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

The copyright year is 2024, but modifications are being made in 2026.

- * Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2024-2026, NVIDIA CORPORATION.  All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h` at line 2, Update the
copyright header in cpp/include/tensorrt_llm/kernels/kvCacheIndex.h from 2024 to
2026 by editing the top-of-file copyright comment (the copyright line at the
file header) so it reads 2026 instead of 2024; ensure the rest of the header
text and formatting remain unchanged.
cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (1)

1-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

The copyright year is 2025, but modifications are being made in 2026.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp` around lines 1 -
2, Update the SPDX header year in the file comment at the top of
kvCacheTransferManager.cpp from 2025 to 2026; locate the existing block comment
starting with "SPDX-FileCopyrightText" and change the year token so the header
reflects 2026.
cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)

1-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

The SPDX copyright year is 2025, but modifications are being made in 2026.

- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/executor/kvCacheConfig.cpp` around lines 1 - 2, Update the
SPDX copyright year from 2025 to 2026 in the file header comment at the top of
kvCacheConfig.cpp (the SPDX-FileCopyrightText line) so the copyright year
reflects the current modifications.
cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)

1-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year to 2026.

The copyright year is 2025, but modifications are being made in 2026.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp` around lines
1 - 2, Update the copyright year in the file header from 2025 to 2026; locate
the SPDX/ copyright block at the top of trtGptModelInflightBatching.cpp (the /*
... SPDX-FileCopyrightText ... */ comment) and change the year value to 2026 so
the file header reflects the current modification year.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Update the SPDX year range to include 2026.

This file now has 2026 changes, but the copyright header still ends at 2025.

As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 1 -
3, Update the SPDX copyright year range in the file header by changing the
SPDX-FileCopyrightText and/or SPDX-License-Identifier year range that currently
ends with "2023-2025" to include 2026 (e.g., "2023-2026"), modifying the SPDX
header comment block at the top of the file so the copyright range reflects the
new 2026 change.
🧹 Nitpick comments (7)
cpp/include/tensorrt_llm/kernels/kvCacheIndex.h (1)

39-42: Consider using static constexpr in the declaration for consistency.

The declaration uses static const but the definition on line 74 uses constexpr. While this is valid C++17+, using static constexpr in both places would be clearer and more consistent.

-    static const KVCacheIndex nullIndex;
+    static constexpr KVCacheIndex nullIndex{};

With this change, the out-of-class definition on line 74 can be removed entirely (C++17 inline constexpr).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h` around lines 39 - 42, The
declaration of the sentinel should be made constexpr to match its definition and
allow inline definition removal: change the declaration of KVCacheIndex
nullIndex to use "static constexpr" (i.e., static constexpr KVCacheIndex
nullIndex) in the class/struct so it becomes an inline constexpr in-header
constant, then remove the out-of-class definition of nullIndex (the existing
constexpr definition currently at line 74). This keeps consistency with
kInvalidPoolIndex and allows C++17 inline constexpr semantics for
KVCacheIndex::nullIndex.
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

6944-6951: Drive the recurrent-state buffer size from the configured linear-layer metadata.

The helper comments out linearLayerIndices, but later zeroes the pool with a separate numLayers / 2 assumption. That makes the test topology implicit instead of being derived from the metadata it is supposed to validate.

Suggested direction
     LinearAttentionMetadata linearAttentionMetadata{
-        // .linearLayerIndices = {2, 5, 8, 11},
+        .linearLayerIndices = {/* linear layers under test */},
         .cacheType = linearWindowSizeCode,
         .allRecurrentStatesBytes = 440 * 1024, // dummy value
         .statesSnapshotInterval = tokensPerBlock * 2,
         .saveLastSnapshot = true,
         .numPlaceholderBlocks = blocksInPrimaryPool * 100,
     };
@@
-    auto ret = cudaMemset(poolBaseAddr, 0,
-        strideBlockId * numLayers / 2 * blocksInPrimaryPool); // half of the layers are linear attention
+    auto const numLinearLayers = /* derive from the configured linear-layer set */;
+    auto ret = cudaMemset(poolBaseAddr, 0, strideBlockId * numLinearLayers * blocksInPrimaryPool);
As per coding guidelines, "Do not use comments to disable C++ code. Use comments to explain code, not hide it."

Also applies to: 7030-7031

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 6944
- 6951, Uncomment and populate LinearAttentionMetadata::linearLayerIndices
instead of relying on an implicit numLayers/2 assumption, and change the code
that zeroes or sizes the recurrent-state buffer to derive its size from
linearLayerIndices.size() (or the metadata instance) rather than a hardcoded
numLayers/2; specifically update the LinearAttentionMetadata initializer
(linearLayerIndices) and the pool-zeroing logic that currently uses numLayers/2
so it reads metadata.linearLayerIndices.size() (or equivalent) when computing
numPlaceholderBlocks/recurrent-state buffer size.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1985-1997: Use named constant instead of magic number -1.

The literal -1 is used as a sentinel value, but this coincides with KVCacheBlock::kCachedBlocksRootId. Consider using the named constant for clarity and to avoid confusion.

Proposed fix
-            KVCacheBlock::IdType nextBlockId = -1;
+            KVCacheBlock::IdType nextBlockId = KVCacheBlock::kCachedBlocksRootId;
             BlockPtr nextBlock = nullptr;
             while (nextBlockIndex < static_cast<int>(beamBlockIds.size()))
             {
                 nextBlockId = beamBlockIds.at(nextBlockIndex);
                 nextBlock = getBlockById(nextBlockId);
                 if (nextBlock != nullptr && !nextBlock->isPlaceholder())
                 {
                     break;
                 }
                 nextBlockIndex++;
             }
-            TLLM_CHECK(nextBlockId != -1);
+            TLLM_CHECK(nextBlockId != KVCacheBlock::kCachedBlocksRootId);

As per coding guidelines, "all other literals should only be used for variable initialization."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp` around lines 1985 - 1997,
Replace the magic sentinel -1 with the named constant
KVCacheBlock::kCachedBlocksRootId: initialize nextBlockId to
KVCacheBlock::kCachedBlocksRootId (instead of -1), use that constant in the
loop/condition checks (e.g., in the TLLM_CHECK) and keep the rest of the logic
(nextBlockIndex, getBlockById, isPlaceholder) unchanged so the sentinel meaning
is explicit.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)

282-282: Clarify the relationship between window size sentinel values.

The constructor default windowSize = -1 introduces a third sentinel value distinct from:

  • std::numeric_limits<int>::max() (unattached sentinel per lines 435-437)
  • kRecurrentStates (0x80000001 ≈ -2147483647)

Consider adding a named constant for this default value to clarify intent and prevent magic number usage.

♻️ Suggested improvement
+    // Sentinel value for unspecified window size in constructor
+    static constexpr SizeType32 kUnspecifiedWindowSize = -1;
+
-    explicit KVCacheBlock(IdType blockId, kernels::KVCacheIndex blockIdx, SizeType32 windowSize = -1);
+    explicit KVCacheBlock(IdType blockId, kernels::KVCacheIndex blockIdx, SizeType32 windowSize = kUnspecifiedWindowSize);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` at line 282, The
constructor KVCacheBlock(IdType blockId, kernels::KVCacheIndex blockIdx,
SizeType32 windowSize = -1) uses a magic sentinel (-1) that is distinct from
std::numeric_limits<int>::max() and kRecurrentStates; replace the literal with a
named constexpr (e.g., kUnspecifiedWindowSize or kDefaultWindowSize) declared
near the other sentinels and update the constructor signature to use that
constant and add a short comment explaining its meaning to clarify intent and
avoid magic numbers.

209-209: Remove redundant type alias.

This line redefines SizeType32 which was already defined at line 74 as tensorrt_llm::runtime::SizeType32. Since WindowSizeType (line 86) is itself an alias to SizeType32, this creates a circular chain that resolves to the same type. The redefinition adds no value and harms readability.

♻️ Suggested removal
-using SizeType32 = WindowSizeType;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` at line 209, Remove
the redundant alias "using SizeType32 = WindowSizeType;" — WindowSizeType is
already an alias to the original tensorrt_llm::runtime::SizeType32, so delete
this redefinition to avoid the circular/duplicate alias; keep the original
tensorrt_llm::runtime::SizeType32 definition and the WindowSizeType alias as-is,
then rebuild to ensure no other duplication-dependent code breaks.

1546-1549: Add precondition check or document caller responsibility.

getRecurrentStatesPool() calls .at(kRecurrentStates) which will throw std::out_of_range if no linear attention manager is configured. Consider either:

  1. Adding a precondition check with a meaningful error message, or
  2. Documenting that callers must verify linear attention is enabled first
♻️ Option 1: Add precondition check
 [[nodiscard]] KVCacheBlockPool const& getRecurrentStatesPool() const
 {
+    TLLM_CHECK_WITH_INFO(mLinearAttentionMetadata.has_value() && 
+        mLinearAttentionMetadata->hasRecurrentStatesCache(),
+        "getRecurrentStatesPool() called but no recurrent states manager is configured");
     return mWindowBlockManagers.at(LinearAttentionMetadata::LinearCacheType::kRecurrentStates).getPool(0);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` around lines 1546 -
1549, getRecurrentStatesPool currently calls
mWindowBlockManagers.at(LinearAttentionMetadata::LinearCacheType::kRecurrentStates)
which can throw std::out_of_range if the linear-attention manager isn't
configured; add a clear precondition check before the .at() access (e.g., verify
mWindowBlockManagers contains the kRecurrentStates key or a boolean flag like
isLinearAttentionEnabled) and throw or log a descriptive error (or use assert
with a message) describing that linear attention must be enabled, or
alternatively update the function's documentation/comment to state that callers
must ensure linear attention is enabled before calling getRecurrentStatesPool.
Ensure references to getRecurrentStatesPool, mWindowBlockManagers and
LinearAttentionMetadata::LinearCacheType::kRecurrentStates are used so reviewers
can locate the change.

1120-1123: Consider encapsulating placeholder block indexing.

The indexing scheme (using abs(blockId) with indices 0-1 unused) is well-documented but could be error-prone if accessed directly. Consider adding a private accessor method like getPlaceholderById(IdType blockId) to encapsulate this logic and prevent off-by-one errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` around lines 1120 -
1123, Add a private accessor to encapsulate the placeholder indexing logic:
implement a method (e.g. getPlaceholderById(IdType blockId) or BlockPtr
getPlaceholderById(int blockId)) that takes the possibly-negative blockId,
computes the correct index via std::abs(blockId), validates that index >= 2 and
< mAllPlaceholderBlocksById.size(), and returns the BlockPtr (or nullptr/throws
on invalid); update all direct uses of mAllPlaceholderBlocksById[...] in the
kvCacheManager class to call getPlaceholderById(...) instead to centralize
bounds checks and eliminate off-by-one errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Line 479: Rename the misspelled local variable "slibings" to "siblings" where
it's declared from current->getNextBlocks() and update all subsequent usages in
the surrounding function in kvCacheManager.cpp (e.g., any references immediately
after the declaration that iterate or index into slibings) to use "siblings"
instead so the identifier matches spelling and avoids compiler errors.
- Around line 643-649: The subtraction result for numPlaceholderBlocks is being
overwritten by an incorrect std::max call; update the clamp so
numPlaceholderBlocks keeps the difference between fullPrimaryBlocks and
allottedPrimaryBlocks but is floored at zero (i.e., replace the std::max call
that currently uses fullPrimaryBlocks with one that uses 0) to ensure
numPlaceholderBlocks = max(fullPrimaryBlocks - allottedPrimaryBlocks, 0) when
retrieving values from blocksPerWindow in kvCacheManager (references:
numPlaceholderBlocks, fullPrimaryBlocks, allottedPrimaryBlocks,
blocksPerWindow).

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 6842-6861: The test constructs a KVCacheManager but never
initializes its pools before calling addSequence(), causing the decoding-growth
path to start from an uninitialized state; update the test around the
KVCacheManager instantiation (the constructor call in kvCacheManagerTest.cpp
that creates kvCacheManager) to call kvCacheManager.allocatePools(false)
immediately after construction (and likewise for the similar spot at the later
test around lines 6872-6873) so the pools are allocated before any
addSequence()/decoding operations.

---

Outside diff comments:
In `@cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h`:
- Line 2: Update the copyright header in
tensorrt_llm/batch_manager/evictionPolicy.h by changing the year range from
"2022-2024" to "2022-2026" so the header reflects the latest 2026 modifications;
locate the top-of-file copyright comment and replace the end year accordingly.

In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h`:
- Line 2: Update the copyright header in
cpp/include/tensorrt_llm/kernels/kvCacheIndex.h from 2024 to 2026 by editing the
top-of-file copyright comment (the copyright line at the file header) so it
reads 2026 instead of 2024; ensure the rest of the header text and formatting
remain unchanged.

In `@cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp`:
- Around line 1-2: Update the SPDX header year in the file comment at the top of
kvCacheTransferManager.cpp from 2025 to 2026; locate the existing block comment
starting with "SPDX-FileCopyrightText" and change the year token so the header
reflects 2026.

In `@cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`:
- Around line 1-2: Update the copyright year in the file header from 2025 to
2026; locate the SPDX/ copyright block at the top of
trtGptModelInflightBatching.cpp (the /* ... SPDX-FileCopyrightText ... */
comment) and change the year value to 2026 so the file header reflects the
current modification year.

In `@cpp/tensorrt_llm/executor/kvCacheConfig.cpp`:
- Around line 1-2: Update the SPDX copyright year from 2025 to 2026 in the file
header comment at the top of kvCacheConfig.cpp (the SPDX-FileCopyrightText line)
so the copyright year reflects the current modifications.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 1-3: Update the SPDX copyright year range in the file header by
changing the SPDX-FileCopyrightText and/or SPDX-License-Identifier year range
that currently ends with "2023-2025" to include 2026 (e.g., "2023-2026"),
modifying the SPDX header comment block at the top of the file so the copyright
range reflects the new 2026 change.

---

Nitpick comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Line 282: The constructor KVCacheBlock(IdType blockId, kernels::KVCacheIndex
blockIdx, SizeType32 windowSize = -1) uses a magic sentinel (-1) that is
distinct from std::numeric_limits<int>::max() and kRecurrentStates; replace the
literal with a named constexpr (e.g., kUnspecifiedWindowSize or
kDefaultWindowSize) declared near the other sentinels and update the constructor
signature to use that constant and add a short comment explaining its meaning to
clarify intent and avoid magic numbers.
- Line 209: Remove the redundant alias "using SizeType32 = WindowSizeType;" —
WindowSizeType is already an alias to the original
tensorrt_llm::runtime::SizeType32, so delete this redefinition to avoid the
circular/duplicate alias; keep the original tensorrt_llm::runtime::SizeType32
definition and the WindowSizeType alias as-is, then rebuild to ensure no other
duplication-dependent code breaks.
- Around line 1546-1549: getRecurrentStatesPool currently calls
mWindowBlockManagers.at(LinearAttentionMetadata::LinearCacheType::kRecurrentStates)
which can throw std::out_of_range if the linear-attention manager isn't
configured; add a clear precondition check before the .at() access (e.g., verify
mWindowBlockManagers contains the kRecurrentStates key or a boolean flag like
isLinearAttentionEnabled) and throw or log a descriptive error (or use assert
with a message) describing that linear attention must be enabled, or
alternatively update the function's documentation/comment to state that callers
must ensure linear attention is enabled before calling getRecurrentStatesPool.
Ensure references to getRecurrentStatesPool, mWindowBlockManagers and
LinearAttentionMetadata::LinearCacheType::kRecurrentStates are used so reviewers
can locate the change.
- Around line 1120-1123: Add a private accessor to encapsulate the placeholder
indexing logic: implement a method (e.g. getPlaceholderById(IdType blockId) or
BlockPtr getPlaceholderById(int blockId)) that takes the possibly-negative
blockId, computes the correct index via std::abs(blockId), validates that index
>= 2 and < mAllPlaceholderBlocksById.size(), and returns the BlockPtr (or
nullptr/throws on invalid); update all direct uses of
mAllPlaceholderBlocksById[...] in the kvCacheManager class to call
getPlaceholderById(...) instead to centralize bounds checks and eliminate
off-by-one errors.

In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h`:
- Around line 39-42: The declaration of the sentinel should be made constexpr to
match its definition and allow inline definition removal: change the declaration
of KVCacheIndex nullIndex to use "static constexpr" (i.e., static constexpr
KVCacheIndex nullIndex) in the class/struct so it becomes an inline constexpr
in-header constant, then remove the out-of-class definition of nullIndex (the
existing constexpr definition currently at line 74). This keeps consistency with
kInvalidPoolIndex and allows C++17 inline constexpr semantics for
KVCacheIndex::nullIndex.

In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 1985-1997: Replace the magic sentinel -1 with the named constant
KVCacheBlock::kCachedBlocksRootId: initialize nextBlockId to
KVCacheBlock::kCachedBlocksRootId (instead of -1), use that constant in the
loop/condition checks (e.g., in the TLLM_CHECK) and keep the rest of the logic
(nextBlockIndex, getBlockById, isPlaceholder) unchanged so the sentinel meaning
is explicit.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 6944-6951: Uncomment and populate
LinearAttentionMetadata::linearLayerIndices instead of relying on an implicit
numLayers/2 assumption, and change the code that zeroes or sizes the
recurrent-state buffer to derive its size from linearLayerIndices.size() (or the
metadata instance) rather than a hardcoded numLayers/2; specifically update the
LinearAttentionMetadata initializer (linearLayerIndices) and the pool-zeroing
logic that currently uses numLayers/2 so it reads
metadata.linearLayerIndices.size() (or equivalent) when computing
numPlaceholderBlocks/recurrent-state buffer size.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 82f07e50-d349-47cc-a006-2fd1307fd438

📥 Commits

Reviewing files that changed from the base of the PR and between 2b5c434 and 80b9f3d.

📒 Files selected for processing (11)
  • cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/include/tensorrt_llm/kernels/kvCacheIndex.h
  • cpp/tensorrt_llm/batch_manager/evictionPolicy.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp
  • cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp
  • cpp/tensorrt_llm/executor/kvCacheConfig.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/batch_manager/radixBlockTreeTest.cpp

…linear_reuse

Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
@VALLIS-NERIA VALLIS-NERIA changed the title [TRTLLM-10061][feat] Add linear state block reuse support in KV cache manager [TRTLLM-10061][feat] Add support of mamba state for C++ KV cache manager Mar 25, 2026
@VALLIS-NERIA VALLIS-NERIA changed the title [TRTLLM-10061][feat] Add support of mamba state for C++ KV cache manager [TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager Mar 25, 2026
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
@VALLIS-NERIA
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #40297 [ run ] triggered by Bot. Commit: 67b19eb Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #40297 [ run ] completed with state SUCCESS. Commit: 67b19eb
/LLM/main/L0_MergeRequest_PR pipeline #31409 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants