[None][feat] Support using FlexKV as anothor KV Cache Offloading option. #9698

axxx03 · 2025-12-04T07:28:53Z

Description

FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud's TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching to enable inference engines to achieve higher throughput and lower latency.

In our case, when intergated with FlexKV, we can achieve the following improvement:

ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.

Co-authored-by: peaceforeverCN

coderabbitai · 2025-12-04T07:32:12Z

📝 Walkthrough

Walkthrough

Adds KV-cache connector matched token tracking across C++ and Python APIs via new accessors and methods on LlmRequest. Introduces FlexKV feature support in PyExecutor with resource management integration. Extends Python bindings to expose new properties. Adds helper function for test configuration and test coverage for connector token matching.

Changes

Cohort / File(s)	Summary
C++ Core API: LlmRequest `cpp/include/tensorrt_llm/batch_manager/llmRequest.h`	Added `getNumConnectorMatchedTokens()`, `setNumConnectorMatchedTokens(SizeType32)` accessors and `isFinishedNormal()` method; introduced private member `mNumConnectorMatchedTokens` for block reuse tracking.
C++ Implementation: KV Cache Manager `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`	Propagates `numConnectorMatchedTokens` into LlmRequest state via `setNumConnectorMatchedTokens()` when KV cache connector is present.
Python Bindings `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp`	Exposed read-only properties `is_finished_normal` and `num_connector_matched_tokens` to Python for GenLlmReq.
Python Executor: FlexKV Support `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added FlexKV feature flag from environment variable `TENSORRT_LLM_USE_FLEXKV`; implemented `_wait_for_flexkv_manager()` and `_kv_connector_refresh_unfinished_tasks()` helper methods; integrated FlexKV manager readiness check and task refresh logic into main executor loops; added FlexKV slot cleanup on request completion; tightened numeric formatting to 3 decimal places.
Python Executor Setup `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Modified `_get_mapping()` to assign created Mapping object to `executor_config.mapping` when `_mapping` is None.
Resource Management `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `free_slot_only()` method to free only the slot associated with a request.
Test Configuration & Coverage `tests/integration/defs/conftest.py`, `tests/integration/defs/llmapi/test_llm_api_connector.py`	Added `llm_model_name()` helper function to resolve model from environment or default to "Qwen2-0.5B"; updated test setup to use dynamic model path; added parametrized test `test_connector_num_matched_tokens` (with threadleak disabled) to verify connector matched token tracking across [0, 32] values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

py_executor.py: Review FlexKV initialization flow, manager readiness logic, task refresh invocation patterns, and resource cleanup paths for correctness and ordering.
kvCacheManager.cpp: Verify that setNumConnectorMatchedTokens() is called at the correct point in the cache population flow.
bindings.cpp (both nanobind and pybind): Ensure property bindings correctly expose the new C++ accessors without side effects.
test_llm_api_connector.py: Check parametrized test logic and assertion correctness, noting the reported duplicate test definition.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.88% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description lacks required sections: no clear title format, missing test coverage details, and incomplete PR checklist.	Add PR title following template format [ticket][type] summary, document test coverage comprehensively, and complete the PR checklist items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title mentions FlexKV as a KV cache offloading option, which aligns with the PR's objective to integrate FlexKV. However, it contains a typo ('anothor' instead of 'another') and is somewhat broad, not capturing key implementation details like connector token matching or the specific enhancements made.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
174-184: Fix undefined executor_config reference in _get_mapping.

The new side-effect executor_config.mapping = mapping will currently fail Ruff (F821) and can raise NameError at runtime, since executor_config is neither defined nor imported in this module.

If the intent is to keep a global executor configuration in sync (presumably the one defined in ._util), explicitly import it and use that symbol:
-from ._util import (KvCacheCreator, _adjust_torch_mem_fraction,
-                    create_py_executor_instance, instantiate_sampler, is_mla,
-                    validate_feature_combination)
+from ._util import (KvCacheCreator, _adjust_torch_mem_fraction,
+                    create_py_executor_instance, executor_config,
+                    instantiate_sampler, is_mla,
+                    validate_feature_combination)
@@
 def _get_mapping(_mapping: Mapping) -> Mapping:
     if _mapping is None:
         mapping = Mapping(world_size=tensorrt_llm.mpi_world_size(),
                           tp_size=tensorrt_llm.mpi_world_size(),
                           gpus_per_node=tensorrt_llm.default_gpus_per_node(),
                           rank=tensorrt_llm.mpi_rank())
-        executor_config.mapping = mapping
+        executor_config.mapping = mapping
If executor_config actually lives elsewhere, adjust the import target accordingly or remove this assignment and wire the mapping through the correct configuration object.

🧹 Nitpick comments (4)

tests/integration/defs/conftest.py (1)
96-103: Align llm_model_name docstring with its actual behavior.

The function correctly returns LLM_MODELS_NAME or defaults to "Qwen2-0.5B", but the docstring still refers to asserting on invalid paths, which never happens here. Consider updating it for clarity:
-def llm_model_name() -> str:
-    '''return LLM_MODELS_NAME if it is set in env, assert when it's set but not a valid path
-    '''
-    DEFAULT_LLM_MODEL_NAME = "Qwen2-0.5B"
-    LLM_MODELS_NAME = os.environ.get("LLM_MODELS_NAME", DEFAULT_LLM_MODEL_NAME)
-
-    return LLM_MODELS_NAME
+def llm_model_name() -> str:
+    """Return the model name from LLM_MODELS_NAME env var, or a default."""
+    default_llm_model_name = "Qwen2-0.5B"
+    return os.environ.get("LLM_MODELS_NAME", default_llm_model_name)
This keeps behavior identical while making the intent clearer.
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

1067-1075: New connector token tracking API and isFinishedNormal look correct; consider a small readability tweak.

mNumConnectorMatchedTokens is safely initialized to 0 and the simple getter/setter are consistent with existing style.

isFinishedNormal() correctly classifies “normal” finishes as {kEND_ID, kSTOP_WORDS, kLENGTH} and will return false if any beam is still kNOT_FINISHED or finished abnormally, which matches the intent suggested by the enum documentation.

Minor readability nit: the lambda in isFinishedNormal uses \ line continuations; you can drop the backslashes and format the chained || expressions across lines normally to avoid preprocessor-style artifacts in regular C++ code.

Also applies to: 1680-1687, 1942-1944

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1420-1432: Connector-matched token propagation is wired correctly; double-check whether reuse metrics should include them.

Integrating getNumNewMatchedTokens and feeding prepopulatedPromptLen + numConnectorMatchedTokens into setPrepopulatedPromptLen(..., getTokensPerBlock()) is consistent with the existing chunking / KV-block invariants and with the guidance to operate in units of tokensPerBlock rather than current block fill. Based on learnings, this is the right place to adjust the effective prepopulated length.

Storing numConnectorMatchedTokens onto the LlmRequest via setNumConnectorMatchedTokens gives clean observability for downstream layers (bindings/tests) without perturbing the rest of the prefill logic.

One open design point to confirm: mReusedTokens and related hit-rate metrics are still updated only with prepopulatedPromptLen, so connector-driven reused tokens are not reflected in those counters. If the goal is for cache statistics to represent all KV reuse (including FlexKV / connectors), you may want to add numConnectorMatchedTokens into mReusedTokens as well; if not, consider a brief comment clarifying that connector reuse is intentionally excluded from the core KV-cache metrics.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
339-343: Consider adding a timeout to prevent indefinite blocking.

The busy-wait loop will block indefinitely if the FlexKV manager never becomes ready, which could cause the initialization to hang.

Consider adding a timeout:
 def _wait_for_flexkv_manager(self):
     if self.kv_connector_manager is not None and self.dist.rank == 0:
+        max_wait_time = 300  # 5 minutes
+        start_time = time.time()
         while not self.kv_connector_manager.scheduler.is_ready():
+            if time.time() - start_time > max_wait_time:
+                raise RuntimeError("FlexKV manager failed to become ready within timeout")
             time.sleep(0.1)
         logger.info("FlexKV manager is ready")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05058f5 and c7ff35e.

📒 Files selected for processing (9)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (3 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1 hunks)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
tests/integration/defs/conftest.py (1 hunks)
tests/integration/defs/llmapi/test_llm_api_connector.py (4 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
tensorrt_llm/_torch/pyexecutor/resource_manager.py
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
tests/integration/defs/conftest.py
tests/integration/defs/llmapi/test_llm_api_connector.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tests/integration/defs/conftest.py
tests/integration/defs/llmapi/test_llm_api_connector.py

**/*.h

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.h: Use a preprocessor guard in C++ header files with the guard name format TRTLLM_ followed by the filename in all caps (e.g., TRTLLM_FOO_BAR_HELLO_H for file FooBarHello.h); do not include directory names in the symbol
Do not use underscore prefix or suffix in C++ preprocessor guard symbols; they are reserved in C++ standard for compilers or implementation

Files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

🧠 Learnings (15)

📓 Common learnings

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

📚 Learning: 2025-08-15T06:46:54.897Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-21T09:41:49.347Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-20T06:56:02.889Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

📚 Learning: 2025-08-20T06:48:45.368Z

Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

📚 Learning: 2025-08-18T08:42:02.640Z

Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

📚 Learning: 2025-08-26T06:07:02.166Z

Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

📚 Learning: 2025-08-01T15:14:45.673Z

Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

📚 Learning: 2025-08-26T09:37:10.463Z

Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tests/integration/defs/llmapi/test_llm_api_connector.py

🧬 Code graph analysis (4)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (437-663)

tensorrt_llm/_torch/speculative/mtp.py (1)

free_resources (81-86)

tensorrt_llm/_torch/speculative/eagle3.py (1)

free_resources (92-96)

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)

free_resources (139-143)

free_resources (242-244)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

cpp/include/tensorrt_llm/executor/types.h (1)

FinishReason (503-598)

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)

handle_metadata (475-481)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

LlmRequestState (47-210)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

free_slot_only (1409-1416)

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

tests/integration/defs/conftest.py (2)

llm_models_root (80-94)

llm_model_name (96-102)

🪛 Ruff (0.14.7)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1147-1147: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

180-180: Undefined name executor_config

(F821)

tests/integration/defs/llmapi/test_llm_api_connector.py

436-436: Unused function argument: enforce_single_worker

(ARG001)

🔇 Additional comments (8)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1409-1416: Slot-only free path looks correct; ensure SEQ_SLOT_MANAGER.free_resources is idempotent.

The helper cleanly targets only ResourceManagerType.SEQ_SLOT_MANAGER, matching the intent to release sequence slots early. This assumes the underlying SEQ_SLOT_MANAGER.free_resources(request) safely handles multiple invocations for the same request (once via free_slot_only, later via the general free_resources pipeline) without raising or leaking state. Please confirm that its implementation is idempotent or guarded accordingly.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

168-173: Python bindings cleanly expose the new request state without mutability.

The is_finished_normal and num_connector_matched_tokens read-only properties correctly mirror the new C++ accessors and follow existing naming conventions; no issues spotted.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

164-169: Nanobind surface matches pybind and C++ API for the new properties.

is_finished_normal and num_connector_matched_tokens are exposed as read-only fields, keeping parity with the pybind bindings and the underlying C++ interface.

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

434-467: LGTM! Well-structured test for connector matched tokens.

The test effectively verifies both the getter (num_connector_matched_tokens) and setter (setNumConnectorMatchedTokens) functionality with good parametrization covering edge cases.

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

299-299: LGTM! FlexKV feature flag correctly initialized.

The environment variable pattern for feature enablement is appropriate.

1187-1189: LGTM! Consistent FlexKV task refresh integration.

The refresh mechanism is correctly integrated in both executor loops to handle pending FlexKV tasks when no batch is scheduled.

Also applies to: 1415-1417

2421-2422: LGTM! Early slot release for FlexKV resource management.

The early slot release when requests finish under FlexKV enables better resource utilization by freeing slots before the put task completes.

589-589: LGTM! Improved profiling time formatting.

The consistent 3-decimal-place formatting improves readability of profiling logs.

Also applies to: 599-599

coderabbitai · 2025-12-04T07:32:15Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

+    def _kv_connector_refresh_unfinished_tasks(self):
+        if not self.use_flexkv:
+            return
+        if len(self.active_requests) == 0:
+            return
+        if not self.kv_connector_manager:
+            return
+        logger.warning(f"No scheduled requests, but flexkv have pending put requests")
+        self.kv_connector_manager.handle_metadata()
+        time.sleep(0.01)


⚠️ Potential issue | 🟡 Minor

Fix f-string and grammar in warning message.

Line 1147 uses an f-string without placeholders and contains a grammatical error.

Apply this diff:

- logger.warning(f"No scheduled requests, but flexkv have pending put requests") + logger.warning("No scheduled requests, but flexkv has pending put requests")

🧰 Tools

🪛 Ruff (0.14.7)

1147-1147: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1140 to 1149, the logger.warning call on line 1147 uses an unnecessary f-string and has a grammar issue; replace the f-string with a normal string and correct the message to something like "No scheduled requests, but FlexKV has pending put requests" (preserve punctuation and capitalization as preferred) so the log reads clearly and without unused interpolation.

coderabbitai · 2025-12-04T07:32:15Z

tests/integration/defs/llmapi/test_llm_api_connector.py

 # limitations under the License.

 import math
+import os


⚠️ Potential issue | 🟡 Minor

Remove unused import.

The os module is imported but not used anywhere in this file.

Apply this diff:

-import os

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import os

🤖 Prompt for AI Agents

In tests/integration/defs/llmapi/test_llm_api_connector.py around line 17, the file imports the os module but it is not used anywhere; remove the unused import line (delete or omit the "import os") so the file no longer contains an unused import and the test module passes linting.

tests/integration/defs/conftest.py

tensorrt_llm/_torch/pyexecutor/py_executor.py

Shixiaowei02 · 2025-12-08T13:05:38Z

@thorjohnsen Could you please help review this? Thanks!

nvpohanh · 2025-12-09T04:44:34Z

@eopXD Please also review this. thanks

jthomson04

I have similar thoughts here as @Shixiaowei02. We should not have flexkv-specific code in the pyexecutor. The entire point of the KV connector is to prevent this sort of thing. If the current KV connector interface isn't sufficient for flexKV, then we can discuss augmenting/improving it. But we should not be adding flexKV-specific "hacks" to make it work.

jthomson04 · 2025-12-12T18:21:36Z

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

    if (mKvCacheConnectorManager && !llmRequest.isDummyRequest())
    {
        numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen);
+        llmRequest.setNumConnectorMatchedTokens(numConnectorMatchedTokens);


Why is this needed? The kv connector should already have this knowledge from when get_num_new_matched_tokens was called.

The flexKV adapter wants to access this data on the Python side, as shown here.

However, TensorRT-LLM seems to only retrieve it on the C++ code, and the data is not passed back to Python. Therefore, I added this interface.

jthomson04 · 2025-12-12T18:23:08Z

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

            [](auto reason) { return reason == executor::FinishReason::kLENGTH; });
    }

+    [[nodiscard]] bool isFinishedNormal() const noexcept


Where is this used? Is this field (via the bindings) only accessed inside the flexKV implementation?

Yes, currently it’s only accessed inside the flexKV implementation, which can be found here. Is this acceptable?

Is there any way to infer that value from the other fields? We ideally don't include new fields in TRTLLM that are exclusively accessed by a specific kv connector implementation.

jthomson04 · 2025-12-12T18:27:21Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

            self.kv_connector_manager.worker.start_load_kv(
                torch.cuda.current_stream())

+    def _kv_connector_refresh_unfinished_tasks(self):


Why is this needed? The py executor continuously polls get_finished even when there are no scheduled requests. That should serve as the point where you refresh unfinished tasks.

axxx03 · 2025-12-16T13:27:33Z

I have similar thoughts here as @Shixiaowei02. We should not have flexkv-specific code in the pyexecutor. The entire point of the KV connector is to prevent this sort of thing. If the current KV connector interface isn't sufficient for flexKV, then we can discuss augmenting/improving it. But we should not be adding flexKV-specific "hacks" to make it work.

Sorry for the late reply, and thanks for the review and feedback. @jthomson04 @Shixiaowei02
I’ll work on removing these hacks.

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py

Shixiaowei02 · 2025-12-25T09:57:20Z

/bot run

tensorrt-cicd · 2025-12-25T10:02:39Z

PR_Github #29945 [ run ] triggered by Bot. Commit: d0da3d7

tensorrt-cicd · 2025-12-25T11:54:21Z

PR_Github #29945 [ run ] completed with state SUCCESS. Commit: d0da3d7
/LLM/main/L0_MergeRequest_PR pipeline #23033 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

axxx03 · 2025-12-26T06:45:19Z

/bot run

eopXD

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

eopXD · 2025-12-30T05:55:41Z

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

            [](auto reason) { return reason == executor::FinishReason::kLENGTH; });
    }

+    [[nodiscard]] bool isFinishedNormal() const noexcept


What do you mean by "normal" here? May you modify the function name to make it more expressive?

eopXD · 2025-12-30T05:58:08Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

        if self.kv_connector_manager:
            self.kv_connector_manager.take_scheduled_requests_pending_load(
                scheduled_batch)
-            self.kv_connector_manager.handle_metadata()


May you explain why is this removed?

axxx03 · 2025-12-30T09:18:55Z

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

Hi, I found that all tests in TensorRT-LLM/tests/integration/defs/llmapi/test_llm_api_connector.py are only added in l0_a10.yml, should I add my test function in that file too?

eopXD · 2025-12-30T13:42:51Z

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

Hi, I found that all tests in TensorRT-LLM/tests/integration/defs/llmapi/test_llm_api_connector.py are only added in l0_a10.yml, should I add my test function in that file too?

Ideally we should have a test on the specific model (possible gemma3, llama3?) and specific platform (h100, a10), where kv cache offload to FlexKV is helpful.

axxx03 · 2025-12-31T07:07:19Z

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

Hi, I found that all tests in TensorRT-LLM/tests/integration/defs/llmapi/test_llm_api_connector.py are only added in l0_a10.yml, should I add my test function in that file too?

Ideally we should have a test on the specific model (possible gemma3, llama3?) and specific platform (h100, a10), where kv cache offload to FlexKV is helpful.

Hi, I have removed the interface in the test function, so there's no need to test.

jthomson04

This introduces significant breaking changes into the kv connector interface.

jthomson04 · 2026-01-07T02:17:04Z

tensorrt_llm/_torch/pyexecutor/py_executor.py


                can_queue = self._can_queue(scheduled_batch)
+
+                if self.kv_connector_manager:


Why are we moving handle_metadata around? This will break all other implementations of the kv connector.

Why are we moving handle_metadata around? This will break all other implementations of the kv connector.

Can the unit tests already added to TRT-LLM detect and ensure that functionality is not broken?

2. support flexkv + cuda graph

Signed-off-by: scutizhang <[email protected]>

Shixiaowei02 · 2026-01-08T02:42:15Z

/bot run

tensorrt-cicd · 2026-01-08T02:52:44Z

PR_Github #30974 [ run ] triggered by Bot. Commit: 8f2fee0

tensorrt-cicd · 2026-01-08T07:57:28Z

PR_Github #30974 [ run ] completed with state SUCCESS. Commit: 8f2fee0
/LLM/main/L0_MergeRequest_PR pipeline #23932 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

eopXD · 2026-01-08T08:39:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-08T09:19:22Z

PR_Github #31039 [ run ] triggered by Bot. Commit: 8f2fee0

eopXD

In our case, when intergated with FlexKV, we can achieve the following improvement: ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.

May you provide setup for your claim of improvement? What is the baseline here?

eopXD · 2026-01-08T12:17:15Z

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py

            block_ids: The KV cacheblock IDs that were allocated.
        """

+    def wait_for_initialization(self):


As we have layer_pre_hook and layer_post_hook, can we have a plugin manager initialization hook?

eopXD · 2026-01-08T12:19:16Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

                    prev_device_step_time = "N/A"  # Handle first iteration
                else:
-                    prev_device_step_time = f"{prev_device_step_time}ms"
+                    prev_device_step_time = f"{prev_device_step_time:.3f} ms"


I argue that this change is not related to the purpose of this merge request.

eopXD · 2026-01-08T12:22:49Z

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

                          tp_size=tensorrt_llm.mpi_world_size(),
                          gpus_per_node=tensorrt_llm.default_gpus_per_node(),
                          rank=tensorrt_llm.mpi_rank())
+        executor_config.mapping = mapping


I think the code change should keep in mind that other types of connector exist too. Overwriting executor_config.mapping here may not be the intent of other Connectors. Maybe you can guard this exclusively for FlexKV connector?

tensorrt-cicd · 2026-01-08T16:18:55Z

PR_Github #31039 [ run ] completed with state SUCCESS. Commit: 8f2fee0
/LLM/main/L0_MergeRequest_PR pipeline #23982 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

axxx03 · 2026-01-09T07:57:30Z

In our case, when intergated with FlexKV, we can achieve the following improvement: ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.

May you provide setup for your claim of improvement? What is the baseline here?

Hardware

Software

TensorRT-LLM: 1.1.0.rc3
FlexKV: 1.2.0

Scripts

The launch scripts:

MODEL_PATH=DeepSeek-V3.1

BATCH_SIZE=4
TP_SIZE=8
EP_SIZE=$TP_SIZE
MAX_SEQ_LEN=16384
MAX_NUM_TOKENS=16384

export FLEXKV_CONFIG_PATH="./flexkv_config.json"

trtllm-serve serve $MODEL_PATH \
    --host 0.0.0.0 \
    --port 6000 \
    --backend pytorch \
    --tp_size $TP_SIZE \
    --ep_size $EP_SIZE \
    --max_seq_len $MAX_SEQ_LEN \
    --max_num_tokens $MAX_NUM_TOKENS \
    --max_batch_size $BATCH_SIZE \
    --extra_llm_api_options extra-llm-api-config.yml

The config yaml:

cuda_graph_config:
  enable_padding: true
  batch_sizes:
    - 1
    - 2
    - 4
enable_chunked_prefill: true
kv_cache_config:
  enable_partial_reuse: false
  free_gpu_memory_fraction: 0.75
kv_connector_config:
  connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
  connector_scheduler_class: "FlexKVSchedulerConnector"
  connector_worker_class: "FlexKVWorkerConnector"
speculative_config: 
  decoding_type: MTP
  num_nextn_predict_layers: 1

Benchmark

	TTFT(s)	TPOT(s)	QPM
TensorRT-LLM	0.42	25.82	125.36
TensorRT-LLM + FlexKV	0.35	21.24	161.66

axxx03 requested review from a team as code owners December 4, 2025 07:28

axxx03 requested a review from shaharmor98 December 4, 2025 07:28

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 4, 2025

Shixiaowei02 requested a review from thorjohnsen December 8, 2025 12:54

Shixiaowei02 reviewed Dec 8, 2025

View reviewed changes

juney-nvidia removed request for shaharmor98 and thorjohnsen December 8, 2025 22:02

jthomson04 requested changes Dec 12, 2025

View reviewed changes

Shixiaowei02 reviewed Dec 23, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py Outdated Show resolved Hide resolved

Shixiaowei02 approved these changes Dec 25, 2025

View reviewed changes

Shixiaowei02 changed the title ~~[Feature] Support using FlexKV as anothor KV Cache Offloading option.~~ [None][feat] Support using FlexKV as anothor KV Cache Offloading option. Dec 25, 2025

axxx03 force-pushed the feature/support_flexkv branch 3 times, most recently from cefe957 to 4c2247d Compare December 26, 2025 06:17

axxx03 requested a review from jthomson04 December 26, 2025 06:43

eopXD reviewed Dec 30, 2025

View reviewed changes

axxx03 force-pushed the feature/support_flexkv branch from 4c2247d to 38ac65c Compare December 31, 2025 07:03

jthomson04 requested changes Jan 7, 2026

View reviewed changes

axxx03 and others added 11 commits January 8, 2026 10:41

1. add flexkv support

146e868

2. support flexkv + cuda graph

fix waiting for flexkv_manager

ffa48f7

fix no free slots error

e31c883

Add ut for connector_num_matched_tokens

1bc42db

fix according to reviewer

de63107

tmp fix

35b1360

rename wait_for_ready

fbee447

fix free_slot_only

7fe69f1

remove NumConnectorMatchedToken

15e4ebf

fix

fcf9d9c

remove tests

8f2fee0

Signed-off-by: scutizhang <[email protected]>

Shixiaowei02 force-pushed the feature/support_flexkv branch from 38ac65c to 8f2fee0 Compare January 8, 2026 02:41

Shixiaowei02 requested a review from pcastonguay January 8, 2026 02:55

eopXD reviewed Jan 8, 2026

View reviewed changes

revert unrelevant change

e9ae5b5


		can_queue = self._can_queue(scheduled_batch)

		if self.kv_connector_manager:

[None][feat] Support using FlexKV as anothor KV Cache Offloading option. #9698

Are you sure you want to change the base?

[None][feat] Support using FlexKV as anothor KV Cache Offloading option. #9698

Uh oh!

Conversation

axxx03 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shixiaowei02 commented Dec 8, 2025

Uh oh!

nvpohanh commented Dec 9, 2025

Uh oh!

jthomson04 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

axxx03 Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jthomson04 Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

axxx03 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Shixiaowei02 commented Dec 25, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

tensorrt-cicd commented Dec 25, 2025

Uh oh!

axxx03 commented Dec 26, 2025

Uh oh!

eopXD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

axxx03 commented Dec 30, 2025

Uh oh!

eopXD commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axxx03 commented Dec 31, 2025

Uh oh!

jthomson04 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

axxx03 commented Dec 4, 2025 •

edited

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading

axxx03 Dec 16, 2025 •

edited

Loading

jthomson04 Dec 12, 2025 •

edited

Loading

axxx03 commented Dec 16, 2025 •

edited

Loading

eopXD commented Dec 30, 2025 •

edited

Loading

jthomson04 left a comment •

edited

Loading

eopXD left a comment •

edited

Loading