[TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager#12531
[TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager#12531VALLIS-NERIA wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
… manager Add block reuse support for hybrid linear (Mamba) models in the C++ KV cache manager. This enables prefix caching for recurrent state blocks alongside standard KV cache blocks, allowing shared prefixes to be reused across requests. Key changes: - Extend KVCacheManager and BlockManager to handle linear state blocks with configurable tokens-per-block for recurrent layers - Add eviction policy support for linear cache blocks - Expose linear cache configuration through nanobind bindings - Add comprehensive unit tests for linear block reuse scenarios Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
📝 WalkthroughWalkthroughThis pull request introduces linear attention (recurrent states) support for KV cache management. Key changes include a new Changes
Sequence DiagramsequenceDiagram
participant App as Application/Request
participant KVM as KVCacheManager
participant BM as BlockManager
participant LAM as LinearAttentionMetadata
participant EV as MaybePlaceholderLRUEvictionPolicy
participant PEV as PlaceholderInnerLRUEvictionPolicy
participant Pool as RecurrentStatesPool
App->>KVM: copyLinearAttentionBlock(llmRequest)
KVM->>BM: copyLinearAttentionBlock(sequence, llmRequest)
alt Need Placeholder Blocks
BM->>EV: getFreeBlock(cacheLevel, wantPlaceholder=true)
alt Is Placeholder Request
EV->>PEV: getFreeBlock (delegate to placeholder policy)
PEV->>PEV: blockIdx (map negative ID)
PEV-->>EV: return placeholder block
end
EV-->>BM: BlockPtr (placeholder)
end
BM->>Pool: store/manage recurrent state blocks
BM->>App: update generation state with copied blocks
alt Block Released
App->>EV: releaseBlock(placeholder)
EV->>EV: getCacheLevel (returns 0 for placeholder)
alt Is Placeholder
EV->>PEV: releaseBlock (delegate)
PEV->>PEV: blockIdx (map negative ID to queue index)
PEV->>PEV: update placeholder queue
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year to include 2026.
The copyright year range ends at 2024, but significant modifications are being made in 2026. As per coding guidelines, update the year to reflect the latest meaningful modification.
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h` at line 2, Update the copyright header in tensorrt_llm/batch_manager/evictionPolicy.h by changing the year range from "2022-2024" to "2022-2026" so the header reflects the latest 2026 modifications; locate the top-of-file copyright comment and replace the end year accordingly.cpp/include/tensorrt_llm/kernels/kvCacheIndex.h (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year to 2026.
The copyright year is 2024, but modifications are being made in 2026.
- * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2024-2026, NVIDIA CORPORATION. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h` at line 2, Update the copyright header in cpp/include/tensorrt_llm/kernels/kvCacheIndex.h from 2024 to 2026 by editing the top-of-file copyright comment (the copyright line at the file header) so it reads 2026 instead of 2024; ensure the rest of the header text and formatting remain unchanged.cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp (1)
1-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year to 2026.
The copyright year is 2025, but modifications are being made in 2026.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp` around lines 1 - 2, Update the SPDX header year in the file comment at the top of kvCacheTransferManager.cpp from 2025 to 2026; locate the existing block comment starting with "SPDX-FileCopyrightText" and change the year token so the header reflects 2026.cpp/tensorrt_llm/executor/kvCacheConfig.cpp (1)
1-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year to 2026.
The SPDX copyright year is 2025, but modifications are being made in 2026.
- * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/executor/kvCacheConfig.cpp` around lines 1 - 2, Update the SPDX copyright year from 2025 to 2026 in the file header comment at the top of kvCacheConfig.cpp (the SPDX-FileCopyrightText line) so the copyright year reflects the current modifications.cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp (1)
1-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year to 2026.
The copyright year is 2025, but modifications are being made in 2026.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp` around lines 1 - 2, Update the copyright year in the file header from 2025 to 2026; locate the SPDX/ copyright block at the top of trtGptModelInflightBatching.cpp (the /* ... SPDX-FileCopyrightText ... */ comment) and change the year value to 2026 so the file header reflects the current modification year.cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
1-3:⚠️ Potential issue | 🟡 MinorUpdate the SPDX year range to include 2026.
This file now has 2026 changes, but the copyright header still ends at 2025.
As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 1 - 3, Update the SPDX copyright year range in the file header by changing the SPDX-FileCopyrightText and/or SPDX-License-Identifier year range that currently ends with "2023-2025" to include 2026 (e.g., "2023-2026"), modifying the SPDX header comment block at the top of the file so the copyright range reflects the new 2026 change.
🧹 Nitpick comments (7)
cpp/include/tensorrt_llm/kernels/kvCacheIndex.h (1)
39-42: Consider usingstatic constexprin the declaration for consistency.The declaration uses
static constbut the definition on line 74 usesconstexpr. While this is valid C++17+, usingstatic constexprin both places would be clearer and more consistent.- static const KVCacheIndex nullIndex; + static constexpr KVCacheIndex nullIndex{};With this change, the out-of-class definition on line 74 can be removed entirely (C++17 inline constexpr).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h` around lines 39 - 42, The declaration of the sentinel should be made constexpr to match its definition and allow inline definition removal: change the declaration of KVCacheIndex nullIndex to use "static constexpr" (i.e., static constexpr KVCacheIndex nullIndex) in the class/struct so it becomes an inline constexpr in-header constant, then remove the out-of-class definition of nullIndex (the existing constexpr definition currently at line 74). This keeps consistency with kInvalidPoolIndex and allows C++17 inline constexpr semantics for KVCacheIndex::nullIndex.cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
6944-6951: Drive the recurrent-state buffer size from the configured linear-layer metadata.The helper comments out
linearLayerIndices, but later zeroes the pool with a separatenumLayers / 2assumption. That makes the test topology implicit instead of being derived from the metadata it is supposed to validate.As per coding guidelines, "Do not use comments to disable C++ code. Use comments to explain code, not hide it."Suggested direction
LinearAttentionMetadata linearAttentionMetadata{ - // .linearLayerIndices = {2, 5, 8, 11}, + .linearLayerIndices = {/* linear layers under test */}, .cacheType = linearWindowSizeCode, .allRecurrentStatesBytes = 440 * 1024, // dummy value .statesSnapshotInterval = tokensPerBlock * 2, .saveLastSnapshot = true, .numPlaceholderBlocks = blocksInPrimaryPool * 100, }; @@ - auto ret = cudaMemset(poolBaseAddr, 0, - strideBlockId * numLayers / 2 * blocksInPrimaryPool); // half of the layers are linear attention + auto const numLinearLayers = /* derive from the configured linear-layer set */; + auto ret = cudaMemset(poolBaseAddr, 0, strideBlockId * numLinearLayers * blocksInPrimaryPool);Also applies to: 7030-7031
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 6944 - 6951, Uncomment and populate LinearAttentionMetadata::linearLayerIndices instead of relying on an implicit numLayers/2 assumption, and change the code that zeroes or sizes the recurrent-state buffer to derive its size from linearLayerIndices.size() (or the metadata instance) rather than a hardcoded numLayers/2; specifically update the LinearAttentionMetadata initializer (linearLayerIndices) and the pool-zeroing logic that currently uses numLayers/2 so it reads metadata.linearLayerIndices.size() (or equivalent) when computing numPlaceholderBlocks/recurrent-state buffer size.cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1985-1997: Use named constant instead of magic number-1.The literal
-1is used as a sentinel value, but this coincides withKVCacheBlock::kCachedBlocksRootId. Consider using the named constant for clarity and to avoid confusion.Proposed fix
- KVCacheBlock::IdType nextBlockId = -1; + KVCacheBlock::IdType nextBlockId = KVCacheBlock::kCachedBlocksRootId; BlockPtr nextBlock = nullptr; while (nextBlockIndex < static_cast<int>(beamBlockIds.size())) { nextBlockId = beamBlockIds.at(nextBlockIndex); nextBlock = getBlockById(nextBlockId); if (nextBlock != nullptr && !nextBlock->isPlaceholder()) { break; } nextBlockIndex++; } - TLLM_CHECK(nextBlockId != -1); + TLLM_CHECK(nextBlockId != KVCacheBlock::kCachedBlocksRootId);As per coding guidelines, "all other literals should only be used for variable initialization."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp` around lines 1985 - 1997, Replace the magic sentinel -1 with the named constant KVCacheBlock::kCachedBlocksRootId: initialize nextBlockId to KVCacheBlock::kCachedBlocksRootId (instead of -1), use that constant in the loop/condition checks (e.g., in the TLLM_CHECK) and keep the rest of the logic (nextBlockIndex, getBlockById, isPlaceholder) unchanged so the sentinel meaning is explicit.cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
282-282: Clarify the relationship between window size sentinel values.The constructor default
windowSize = -1introduces a third sentinel value distinct from:
std::numeric_limits<int>::max()(unattached sentinel per lines 435-437)kRecurrentStates(0x80000001 ≈ -2147483647)Consider adding a named constant for this default value to clarify intent and prevent magic number usage.
♻️ Suggested improvement
+ // Sentinel value for unspecified window size in constructor + static constexpr SizeType32 kUnspecifiedWindowSize = -1; + - explicit KVCacheBlock(IdType blockId, kernels::KVCacheIndex blockIdx, SizeType32 windowSize = -1); + explicit KVCacheBlock(IdType blockId, kernels::KVCacheIndex blockIdx, SizeType32 windowSize = kUnspecifiedWindowSize);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` at line 282, The constructor KVCacheBlock(IdType blockId, kernels::KVCacheIndex blockIdx, SizeType32 windowSize = -1) uses a magic sentinel (-1) that is distinct from std::numeric_limits<int>::max() and kRecurrentStates; replace the literal with a named constexpr (e.g., kUnspecifiedWindowSize or kDefaultWindowSize) declared near the other sentinels and update the constructor signature to use that constant and add a short comment explaining its meaning to clarify intent and avoid magic numbers.
209-209: Remove redundant type alias.This line redefines
SizeType32which was already defined at line 74 astensorrt_llm::runtime::SizeType32. SinceWindowSizeType(line 86) is itself an alias toSizeType32, this creates a circular chain that resolves to the same type. The redefinition adds no value and harms readability.♻️ Suggested removal
-using SizeType32 = WindowSizeType;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` at line 209, Remove the redundant alias "using SizeType32 = WindowSizeType;" — WindowSizeType is already an alias to the original tensorrt_llm::runtime::SizeType32, so delete this redefinition to avoid the circular/duplicate alias; keep the original tensorrt_llm::runtime::SizeType32 definition and the WindowSizeType alias as-is, then rebuild to ensure no other duplication-dependent code breaks.
1546-1549: Add precondition check or document caller responsibility.
getRecurrentStatesPool()calls.at(kRecurrentStates)which will throwstd::out_of_rangeif no linear attention manager is configured. Consider either:
- Adding a precondition check with a meaningful error message, or
- Documenting that callers must verify linear attention is enabled first
♻️ Option 1: Add precondition check
[[nodiscard]] KVCacheBlockPool const& getRecurrentStatesPool() const { + TLLM_CHECK_WITH_INFO(mLinearAttentionMetadata.has_value() && + mLinearAttentionMetadata->hasRecurrentStatesCache(), + "getRecurrentStatesPool() called but no recurrent states manager is configured"); return mWindowBlockManagers.at(LinearAttentionMetadata::LinearCacheType::kRecurrentStates).getPool(0); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` around lines 1546 - 1549, getRecurrentStatesPool currently calls mWindowBlockManagers.at(LinearAttentionMetadata::LinearCacheType::kRecurrentStates) which can throw std::out_of_range if the linear-attention manager isn't configured; add a clear precondition check before the .at() access (e.g., verify mWindowBlockManagers contains the kRecurrentStates key or a boolean flag like isLinearAttentionEnabled) and throw or log a descriptive error (or use assert with a message) describing that linear attention must be enabled, or alternatively update the function's documentation/comment to state that callers must ensure linear attention is enabled before calling getRecurrentStatesPool. Ensure references to getRecurrentStatesPool, mWindowBlockManagers and LinearAttentionMetadata::LinearCacheType::kRecurrentStates are used so reviewers can locate the change.
1120-1123: Consider encapsulating placeholder block indexing.The indexing scheme (using
abs(blockId)with indices 0-1 unused) is well-documented but could be error-prone if accessed directly. Consider adding a private accessor method likegetPlaceholderById(IdType blockId)to encapsulate this logic and prevent off-by-one errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` around lines 1120 - 1123, Add a private accessor to encapsulate the placeholder indexing logic: implement a method (e.g. getPlaceholderById(IdType blockId) or BlockPtr getPlaceholderById(int blockId)) that takes the possibly-negative blockId, computes the correct index via std::abs(blockId), validates that index >= 2 and < mAllPlaceholderBlocksById.size(), and returns the BlockPtr (or nullptr/throws on invalid); update all direct uses of mAllPlaceholderBlocksById[...] in the kvCacheManager class to call getPlaceholderById(...) instead to centralize bounds checks and eliminate off-by-one errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Line 479: Rename the misspelled local variable "slibings" to "siblings" where
it's declared from current->getNextBlocks() and update all subsequent usages in
the surrounding function in kvCacheManager.cpp (e.g., any references immediately
after the declaration that iterate or index into slibings) to use "siblings"
instead so the identifier matches spelling and avoids compiler errors.
- Around line 643-649: The subtraction result for numPlaceholderBlocks is being
overwritten by an incorrect std::max call; update the clamp so
numPlaceholderBlocks keeps the difference between fullPrimaryBlocks and
allottedPrimaryBlocks but is floored at zero (i.e., replace the std::max call
that currently uses fullPrimaryBlocks with one that uses 0) to ensure
numPlaceholderBlocks = max(fullPrimaryBlocks - allottedPrimaryBlocks, 0) when
retrieving values from blocksPerWindow in kvCacheManager (references:
numPlaceholderBlocks, fullPrimaryBlocks, allottedPrimaryBlocks,
blocksPerWindow).
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 6842-6861: The test constructs a KVCacheManager but never
initializes its pools before calling addSequence(), causing the decoding-growth
path to start from an uninitialized state; update the test around the
KVCacheManager instantiation (the constructor call in kvCacheManagerTest.cpp
that creates kvCacheManager) to call kvCacheManager.allocatePools(false)
immediately after construction (and likewise for the similar spot at the later
test around lines 6872-6873) so the pools are allocated before any
addSequence()/decoding operations.
---
Outside diff comments:
In `@cpp/include/tensorrt_llm/batch_manager/evictionPolicy.h`:
- Line 2: Update the copyright header in
tensorrt_llm/batch_manager/evictionPolicy.h by changing the year range from
"2022-2024" to "2022-2026" so the header reflects the latest 2026 modifications;
locate the top-of-file copyright comment and replace the end year accordingly.
In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h`:
- Line 2: Update the copyright header in
cpp/include/tensorrt_llm/kernels/kvCacheIndex.h from 2024 to 2026 by editing the
top-of-file copyright comment (the copyright line at the file header) so it
reads 2026 instead of 2024; ensure the rest of the header text and formatting
remain unchanged.
In `@cpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cpp`:
- Around line 1-2: Update the SPDX header year in the file comment at the top of
kvCacheTransferManager.cpp from 2025 to 2026; locate the existing block comment
starting with "SPDX-FileCopyrightText" and change the year token so the header
reflects 2026.
In `@cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp`:
- Around line 1-2: Update the copyright year in the file header from 2025 to
2026; locate the SPDX/ copyright block at the top of
trtGptModelInflightBatching.cpp (the /* ... SPDX-FileCopyrightText ... */
comment) and change the year value to 2026 so the file header reflects the
current modification year.
In `@cpp/tensorrt_llm/executor/kvCacheConfig.cpp`:
- Around line 1-2: Update the SPDX copyright year from 2025 to 2026 in the file
header comment at the top of kvCacheConfig.cpp (the SPDX-FileCopyrightText line)
so the copyright year reflects the current modifications.
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 1-3: Update the SPDX copyright year range in the file header by
changing the SPDX-FileCopyrightText and/or SPDX-License-Identifier year range
that currently ends with "2023-2025" to include 2026 (e.g., "2023-2026"),
modifying the SPDX header comment block at the top of the file so the copyright
range reflects the new 2026 change.
---
Nitpick comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Line 282: The constructor KVCacheBlock(IdType blockId, kernels::KVCacheIndex
blockIdx, SizeType32 windowSize = -1) uses a magic sentinel (-1) that is
distinct from std::numeric_limits<int>::max() and kRecurrentStates; replace the
literal with a named constexpr (e.g., kUnspecifiedWindowSize or
kDefaultWindowSize) declared near the other sentinels and update the constructor
signature to use that constant and add a short comment explaining its meaning to
clarify intent and avoid magic numbers.
- Line 209: Remove the redundant alias "using SizeType32 = WindowSizeType;" —
WindowSizeType is already an alias to the original
tensorrt_llm::runtime::SizeType32, so delete this redefinition to avoid the
circular/duplicate alias; keep the original tensorrt_llm::runtime::SizeType32
definition and the WindowSizeType alias as-is, then rebuild to ensure no other
duplication-dependent code breaks.
- Around line 1546-1549: getRecurrentStatesPool currently calls
mWindowBlockManagers.at(LinearAttentionMetadata::LinearCacheType::kRecurrentStates)
which can throw std::out_of_range if the linear-attention manager isn't
configured; add a clear precondition check before the .at() access (e.g., verify
mWindowBlockManagers contains the kRecurrentStates key or a boolean flag like
isLinearAttentionEnabled) and throw or log a descriptive error (or use assert
with a message) describing that linear attention must be enabled, or
alternatively update the function's documentation/comment to state that callers
must ensure linear attention is enabled before calling getRecurrentStatesPool.
Ensure references to getRecurrentStatesPool, mWindowBlockManagers and
LinearAttentionMetadata::LinearCacheType::kRecurrentStates are used so reviewers
can locate the change.
- Around line 1120-1123: Add a private accessor to encapsulate the placeholder
indexing logic: implement a method (e.g. getPlaceholderById(IdType blockId) or
BlockPtr getPlaceholderById(int blockId)) that takes the possibly-negative
blockId, computes the correct index via std::abs(blockId), validates that index
>= 2 and < mAllPlaceholderBlocksById.size(), and returns the BlockPtr (or
nullptr/throws on invalid); update all direct uses of
mAllPlaceholderBlocksById[...] in the kvCacheManager class to call
getPlaceholderById(...) instead to centralize bounds checks and eliminate
off-by-one errors.
In `@cpp/include/tensorrt_llm/kernels/kvCacheIndex.h`:
- Around line 39-42: The declaration of the sentinel should be made constexpr to
match its definition and allow inline definition removal: change the declaration
of KVCacheIndex nullIndex to use "static constexpr" (i.e., static constexpr
KVCacheIndex nullIndex) in the class/struct so it becomes an inline constexpr
in-header constant, then remove the out-of-class definition of nullIndex (the
existing constexpr definition currently at line 74). This keeps consistency with
kInvalidPoolIndex and allows C++17 inline constexpr semantics for
KVCacheIndex::nullIndex.
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 1985-1997: Replace the magic sentinel -1 with the named constant
KVCacheBlock::kCachedBlocksRootId: initialize nextBlockId to
KVCacheBlock::kCachedBlocksRootId (instead of -1), use that constant in the
loop/condition checks (e.g., in the TLLM_CHECK) and keep the rest of the logic
(nextBlockIndex, getBlockById, isPlaceholder) unchanged so the sentinel meaning
is explicit.
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 6944-6951: Uncomment and populate
LinearAttentionMetadata::linearLayerIndices instead of relying on an implicit
numLayers/2 assumption, and change the code that zeroes or sizes the
recurrent-state buffer to derive its size from linearLayerIndices.size() (or the
metadata instance) rather than a hardcoded numLayers/2; specifically update the
LinearAttentionMetadata initializer (linearLayerIndices) and the pool-zeroing
logic that currently uses numLayers/2 so it reads
metadata.linearLayerIndices.size() (or equivalent) when computing
numPlaceholderBlocks/recurrent-state buffer size.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 82f07e50-d349-47cc-a006-2fd1307fd438
📒 Files selected for processing (11)
cpp/include/tensorrt_llm/batch_manager/evictionPolicy.hcpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/include/tensorrt_llm/kernels/kvCacheIndex.hcpp/tensorrt_llm/batch_manager/evictionPolicy.cppcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/batch_manager/kvCacheTransferManager.cppcpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cppcpp/tensorrt_llm/executor/kvCacheConfig.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cppcpp/tests/unit_tests/batch_manager/radixBlockTreeTest.cpp
…linear_reuse Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
Signed-off-by: Xiwen Yu <13230610+VALLIS-NERIA@users.noreply.github.com>
|
/bot run |
|
PR_Github #40297 [ run ] triggered by Bot. Commit: |
|
PR_Github #40297 [ run ] completed with state
|
This is a part of TRTLLM-10061. This PR only adds feature to the KV cache manager but does not integrate it with runtime or any model. The runtime and models integration will be done in a separate PR.
Summary
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes