Skip to content

Conversation

@axxx03
Copy link

@axxx03 axxx03 commented Dec 4, 2025

Description

FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud's TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching to enable inference engines to achieve higher throughput and lower latency.

In our case, when intergated with FlexKV, we can achieve the following improvement:

  • ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.

Co-authored-by: peaceforeverCN

@axxx03 axxx03 requested review from a team as code owners December 4, 2025 07:28
@axxx03 axxx03 requested a review from shaharmor98 December 4, 2025 07:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

Adds KV-cache connector matched token tracking across C++ and Python APIs via new accessors and methods on LlmRequest. Introduces FlexKV feature support in PyExecutor with resource management integration. Extends Python bindings to expose new properties. Adds helper function for test configuration and test coverage for connector token matching.

Changes

Cohort / File(s) Summary
C++ Core API: LlmRequest
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
Added getNumConnectorMatchedTokens(), setNumConnectorMatchedTokens(SizeType32) accessors and isFinishedNormal() method; introduced private member mNumConnectorMatchedTokens for block reuse tracking.
C++ Implementation: KV Cache Manager
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Propagates numConnectorMatchedTokens into LlmRequest state via setNumConnectorMatchedTokens() when KV cache connector is present.
Python Bindings
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp, cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
Exposed read-only properties is_finished_normal and num_connector_matched_tokens to Python for GenLlmReq.
Python Executor: FlexKV Support
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added FlexKV feature flag from environment variable TENSORRT_LLM_USE_FLEXKV; implemented _wait_for_flexkv_manager() and _kv_connector_refresh_unfinished_tasks() helper methods; integrated FlexKV manager readiness check and task refresh logic into main executor loops; added FlexKV slot cleanup on request completion; tightened numeric formatting to 3 decimal places.
Python Executor Setup
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Modified _get_mapping() to assign created Mapping object to executor_config.mapping when _mapping is None.
Resource Management
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added free_slot_only() method to free only the slot associated with a request.
Test Configuration & Coverage
tests/integration/defs/conftest.py, tests/integration/defs/llmapi/test_llm_api_connector.py
Added llm_model_name() helper function to resolve model from environment or default to "Qwen2-0.5B"; updated test setup to use dynamic model path; added parametrized test test_connector_num_matched_tokens (with threadleak disabled) to verify connector matched token tracking across [0, 32] values.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • py_executor.py: Review FlexKV initialization flow, manager readiness logic, task refresh invocation patterns, and resource cleanup paths for correctness and ordering.
  • kvCacheManager.cpp: Verify that setNumConnectorMatchedTokens() is called at the correct point in the cache population flow.
  • bindings.cpp (both nanobind and pybind): Ensure property bindings correctly expose the new C++ accessors without side effects.
  • test_llm_api_connector.py: Check parametrized test logic and assertion correctness, noting the reported duplicate test definition.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.88% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description lacks required sections: no clear title format, missing test coverage details, and incomplete PR checklist. Add PR title following template format [ticket][type] summary, document test coverage comprehensively, and complete the PR checklist items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title mentions FlexKV as a KV cache offloading option, which aligns with the PR's objective to integrate FlexKV. However, it contains a typo ('anothor' instead of 'another') and is somewhat broad, not capturing key implementation details like connector token matching or the specific enhancements made.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

174-184: Fix undefined executor_config reference in _get_mapping.

The new side-effect executor_config.mapping = mapping will currently fail Ruff (F821) and can raise NameError at runtime, since executor_config is neither defined nor imported in this module.

If the intent is to keep a global executor configuration in sync (presumably the one defined in ._util), explicitly import it and use that symbol:

-from ._util import (KvCacheCreator, _adjust_torch_mem_fraction,
-                    create_py_executor_instance, instantiate_sampler, is_mla,
-                    validate_feature_combination)
+from ._util import (KvCacheCreator, _adjust_torch_mem_fraction,
+                    create_py_executor_instance, executor_config,
+                    instantiate_sampler, is_mla,
+                    validate_feature_combination)
@@
 def _get_mapping(_mapping: Mapping) -> Mapping:
     if _mapping is None:
         mapping = Mapping(world_size=tensorrt_llm.mpi_world_size(),
                           tp_size=tensorrt_llm.mpi_world_size(),
                           gpus_per_node=tensorrt_llm.default_gpus_per_node(),
                           rank=tensorrt_llm.mpi_rank())
-        executor_config.mapping = mapping
+        executor_config.mapping = mapping

If executor_config actually lives elsewhere, adjust the import target accordingly or remove this assignment and wire the mapping through the correct configuration object.

🧹 Nitpick comments (4)
tests/integration/defs/conftest.py (1)

96-103: Align llm_model_name docstring with its actual behavior.

The function correctly returns LLM_MODELS_NAME or defaults to "Qwen2-0.5B", but the docstring still refers to asserting on invalid paths, which never happens here. Consider updating it for clarity:

-def llm_model_name() -> str:
-    '''return LLM_MODELS_NAME if it is set in env, assert when it's set but not a valid path
-    '''
-    DEFAULT_LLM_MODEL_NAME = "Qwen2-0.5B"
-    LLM_MODELS_NAME = os.environ.get("LLM_MODELS_NAME", DEFAULT_LLM_MODEL_NAME)
-
-    return LLM_MODELS_NAME
+def llm_model_name() -> str:
+    """Return the model name from LLM_MODELS_NAME env var, or a default."""
+    default_llm_model_name = "Qwen2-0.5B"
+    return os.environ.get("LLM_MODELS_NAME", default_llm_model_name)

This keeps behavior identical while making the intent clearer.

cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)

1067-1075: New connector token tracking API and isFinishedNormal look correct; consider a small readability tweak.

  • mNumConnectorMatchedTokens is safely initialized to 0 and the simple getter/setter are consistent with existing style.
  • isFinishedNormal() correctly classifies “normal” finishes as {kEND_ID, kSTOP_WORDS, kLENGTH} and will return false if any beam is still kNOT_FINISHED or finished abnormally, which matches the intent suggested by the enum documentation.
  • Minor readability nit: the lambda in isFinishedNormal uses \ line continuations; you can drop the backslashes and format the chained || expressions across lines normally to avoid preprocessor-style artifacts in regular C++ code.

Also applies to: 1680-1687, 1942-1944

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

1420-1432: Connector-matched token propagation is wired correctly; double-check whether reuse metrics should include them.

  • Integrating getNumNewMatchedTokens and feeding prepopulatedPromptLen + numConnectorMatchedTokens into setPrepopulatedPromptLen(..., getTokensPerBlock()) is consistent with the existing chunking / KV-block invariants and with the guidance to operate in units of tokensPerBlock rather than current block fill. Based on learnings, this is the right place to adjust the effective prepopulated length.
  • Storing numConnectorMatchedTokens onto the LlmRequest via setNumConnectorMatchedTokens gives clean observability for downstream layers (bindings/tests) without perturbing the rest of the prefill logic.

One open design point to confirm: mReusedTokens and related hit-rate metrics are still updated only with prepopulatedPromptLen, so connector-driven reused tokens are not reflected in those counters. If the goal is for cache statistics to represent all KV reuse (including FlexKV / connectors), you may want to add numConnectorMatchedTokens into mReusedTokens as well; if not, consider a brief comment clarifying that connector reuse is intentionally excluded from the core KV-cache metrics.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

339-343: Consider adding a timeout to prevent indefinite blocking.

The busy-wait loop will block indefinitely if the FlexKV manager never becomes ready, which could cause the initialization to hang.

Consider adding a timeout:

 def _wait_for_flexkv_manager(self):
     if self.kv_connector_manager is not None and self.dist.rank == 0:
+        max_wait_time = 300  # 5 minutes
+        start_time = time.time()
         while not self.kv_connector_manager.scheduler.is_ready():
+            if time.time() - start_time > max_wait_time:
+                raise RuntimeError("FlexKV manager failed to become ready within timeout")
             time.sleep(0.1)
         logger.info("FlexKV manager is ready")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 05058f5 and c7ff35e.

📒 Files selected for processing (9)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (8 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tests/integration/defs/conftest.py (1 hunks)
  • tests/integration/defs/llmapi/test_llm_api_connector.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
  • tests/integration/defs/conftest.py
  • tests/integration/defs/llmapi/test_llm_api_connector.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tests/integration/defs/conftest.py
  • tests/integration/defs/llmapi/test_llm_api_connector.py
**/*.h

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.h: Use a preprocessor guard in C++ header files with the guard name format TRTLLM_ followed by the filename in all caps (e.g., TRTLLM_FOO_BAR_HELLO_H for file FooBarHello.h); do not include directory names in the symbol
Do not use underscore prefix or suffix in C++ preprocessor guard symbols; they are reserved in C++ standard for compilers or implementation

Files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
🧠 Learnings (15)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.

Applied to files:

  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tests/integration/defs/llmapi/test_llm_api_connector.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (437-663)
tensorrt_llm/_torch/speculative/mtp.py (1)
  • free_resources (81-86)
tensorrt_llm/_torch/speculative/eagle3.py (1)
  • free_resources (92-96)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)
  • free_resources (139-143)
  • free_resources (242-244)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
cpp/include/tensorrt_llm/executor/types.h (1)
  • FinishReason (503-598)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)
  • handle_metadata (475-481)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • LlmRequestState (47-210)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
  • free_slot_only (1409-1416)
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
tests/integration/defs/conftest.py (2)
  • llm_models_root (80-94)
  • llm_model_name (96-102)
🪛 Ruff (0.14.7)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1147-1147: f-string without any placeholders

Remove extraneous f prefix

(F541)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

180-180: Undefined name executor_config

(F821)

tests/integration/defs/llmapi/test_llm_api_connector.py

436-436: Unused function argument: enforce_single_worker

(ARG001)

🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1409-1416: Slot-only free path looks correct; ensure SEQ_SLOT_MANAGER.free_resources is idempotent.

The helper cleanly targets only ResourceManagerType.SEQ_SLOT_MANAGER, matching the intent to release sequence slots early. This assumes the underlying SEQ_SLOT_MANAGER.free_resources(request) safely handles multiple invocations for the same request (once via free_slot_only, later via the general free_resources pipeline) without raising or leaking state. Please confirm that its implementation is idempotent or guarded accordingly.

cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)

168-173: Python bindings cleanly expose the new request state without mutability.

The is_finished_normal and num_connector_matched_tokens read-only properties correctly mirror the new C++ accessors and follow existing naming conventions; no issues spotted.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

164-169: Nanobind surface matches pybind and C++ API for the new properties.

is_finished_normal and num_connector_matched_tokens are exposed as read-only fields, keeping parity with the pybind bindings and the underlying C++ interface.

tests/integration/defs/llmapi/test_llm_api_connector.py (1)

434-467: LGTM! Well-structured test for connector matched tokens.

The test effectively verifies both the getter (num_connector_matched_tokens) and setter (setNumConnectorMatchedTokens) functionality with good parametrization covering edge cases.

tensorrt_llm/_torch/pyexecutor/py_executor.py (4)

299-299: LGTM! FlexKV feature flag correctly initialized.

The environment variable pattern for feature enablement is appropriate.


1187-1189: LGTM! Consistent FlexKV task refresh integration.

The refresh mechanism is correctly integrated in both executor loops to handle pending FlexKV tasks when no batch is scheduled.

Also applies to: 1415-1417


2421-2422: LGTM! Early slot release for FlexKV resource management.

The early slot release when requests finish under FlexKV enables better resource utilization by freeing slots before the put task completes.


589-589: LGTM! Improved profiling time formatting.

The consistent 3-decimal-place formatting improves readability of profiling logs.

Also applies to: 599-599

Comment on lines 1140 to 1149
def _kv_connector_refresh_unfinished_tasks(self):
if not self.use_flexkv:
return
if len(self.active_requests) == 0:
return
if not self.kv_connector_manager:
return
logger.warning(f"No scheduled requests, but flexkv have pending put requests")
self.kv_connector_manager.handle_metadata()
time.sleep(0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix f-string and grammar in warning message.

Line 1147 uses an f-string without placeholders and contains a grammatical error.

Apply this diff:

-        logger.warning(f"No scheduled requests, but flexkv have pending put requests")
+        logger.warning("No scheduled requests, but flexkv has pending put requests")
🧰 Tools
🪛 Ruff (0.14.7)

1147-1147: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1140 to 1149, the
logger.warning call on line 1147 uses an unnecessary f-string and has a grammar
issue; replace the f-string with a normal string and correct the message to
something like "No scheduled requests, but FlexKV has pending put requests"
(preserve punctuation and capitalization as preferred) so the log reads clearly
and without unused interpolation.

# limitations under the License.

import math
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused import.

The os module is imported but not used anywhere in this file.

Apply this diff:

-import os
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import os
🤖 Prompt for AI Agents
In tests/integration/defs/llmapi/test_llm_api_connector.py around line 17, the
file imports the os module but it is not used anywhere; remove the unused import
line (delete or omit the "import os") so the file no longer contains an unused
import and the test module passes linting.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 4, 2025
@Shixiaowei02
Copy link
Collaborator

@thorjohnsen Could you please help review this? Thanks!

@nvpohanh
Copy link
Collaborator

nvpohanh commented Dec 9, 2025

@eopXD Please also review this. thanks

Copy link
Collaborator

@jthomson04 jthomson04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar thoughts here as @Shixiaowei02. We should not have flexkv-specific code in the pyexecutor. The entire point of the KV connector is to prevent this sort of thing. If the current KV connector interface isn't sufficient for flexKV, then we can discuss augmenting/improving it. But we should not be adding flexKV-specific "hacks" to make it work.

if (mKvCacheConnectorManager && !llmRequest.isDummyRequest())
{
numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen);
llmRequest.setNumConnectorMatchedTokens(numConnectorMatchedTokens);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? The kv connector should already have this knowledge from when get_num_new_matched_tokens was called.

Copy link
Author

@axxx03 axxx03 Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flexKV adapter wants to access this data on the Python side, as shown here.

However, TensorRT-LLM seems to only retrieve it on the C++ code, and the data is not passed back to Python. Therefore, I added this interface.

[](auto reason) { return reason == executor::FinishReason::kLENGTH; });
}

[[nodiscard]] bool isFinishedNormal() const noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used? Is this field (via the bindings) only accessed inside the flexKV implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently it’s only accessed inside the flexKV implementation, which can be found here. Is this acceptable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to infer that value from the other fields? We ideally don't include new fields in TRTLLM that are exclusively accessed by a specific kv connector implementation.

self.kv_connector_manager.worker.start_load_kv(
torch.cuda.current_stream())

def _kv_connector_refresh_unfinished_tasks(self):
Copy link
Collaborator

@jthomson04 jthomson04 Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? The py executor continuously polls get_finished even when there are no scheduled requests. That should serve as the point where you refresh unfinished tasks.

@axxx03
Copy link
Author

axxx03 commented Dec 16, 2025

I have similar thoughts here as @Shixiaowei02. We should not have flexkv-specific code in the pyexecutor. The entire point of the KV connector is to prevent this sort of thing. If the current KV connector interface isn't sufficient for flexKV, then we can discuss augmenting/improving it. But we should not be adding flexKV-specific "hacks" to make it work.

Sorry for the late reply, and thanks for the review and feedback. @jthomson04 @Shixiaowei02
I’ll work on removing these hacks.

@Shixiaowei02 Shixiaowei02 changed the title [Feature] Support using FlexKV as anothor KV Cache Offloading option. [None][feat] Support using FlexKV as anothor KV Cache Offloading option. Dec 25, 2025
@Shixiaowei02
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29945 [ run ] triggered by Bot. Commit: d0da3d7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29945 [ run ] completed with state SUCCESS. Commit: d0da3d7
/LLM/main/L0_MergeRequest_PR pipeline #23033 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@axxx03 axxx03 force-pushed the feature/support_flexkv branch 3 times, most recently from cefe957 to 4c2247d Compare December 26, 2025 06:17
@axxx03 axxx03 requested a review from jthomson04 December 26, 2025 06:43
@axxx03
Copy link
Author

axxx03 commented Dec 26, 2025

/bot run

Copy link
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

[](auto reason) { return reason == executor::FinishReason::kLENGTH; });
}

[[nodiscard]] bool isFinishedNormal() const noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "normal" here? May you modify the function name to make it more expressive?

if self.kv_connector_manager:
self.kv_connector_manager.take_scheduled_requests_pending_load(
scheduled_batch)
self.kv_connector_manager.handle_metadata()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you explain why is this removed?

@axxx03
Copy link
Author

axxx03 commented Dec 30, 2025

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

Hi, I found that all tests in TensorRT-LLM/tests/integration/defs/llmapi/test_llm_api_connector.py are only added in l0_a10.yml, should I add my test function in that file too?

@eopXD
Copy link
Collaborator

eopXD commented Dec 30, 2025

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

Hi, I found that all tests in TensorRT-LLM/tests/integration/defs/llmapi/test_llm_api_connector.py are only added in l0_a10.yml, should I add my test function in that file too?

Ideally we should have a test on the specific model (possible gemma3, llama3?) and specific platform (h100, a10), where kv cache offload to FlexKV is helpful.

@axxx03 axxx03 force-pushed the feature/support_flexkv branch from 4c2247d to 38ac65c Compare December 31, 2025 07:03
@axxx03
Copy link
Author

axxx03 commented Dec 31, 2025

Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml

Hi, I found that all tests in TensorRT-LLM/tests/integration/defs/llmapi/test_llm_api_connector.py are only added in l0_a10.yml, should I add my test function in that file too?

Ideally we should have a test on the specific model (possible gemma3, llama3?) and specific platform (h100, a10), where kv cache offload to FlexKV is helpful.

Hi, I have removed the interface in the test function, so there's no need to test.

Copy link
Collaborator

@jthomson04 jthomson04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces significant breaking changes into the kv connector interface.


can_queue = self._can_queue(scheduled_batch)

if self.kv_connector_manager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving handle_metadata around? This will break all other implementations of the kv connector.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we moving handle_metadata around? This will break all other implementations of the kv connector.

Can the unit tests already added to TRT-LLM detect and ensure that functionality is not broken?

@Shixiaowei02 Shixiaowei02 force-pushed the feature/support_flexkv branch from 38ac65c to 8f2fee0 Compare January 8, 2026 02:41
@Shixiaowei02
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30974 [ run ] triggered by Bot. Commit: 8f2fee0

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30974 [ run ] completed with state SUCCESS. Commit: 8f2fee0
/LLM/main/L0_MergeRequest_PR pipeline #23932 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@eopXD
Copy link
Collaborator

eopXD commented Jan 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31039 [ run ] triggered by Bot. Commit: 8f2fee0

Copy link
Collaborator

@eopXD eopXD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case, when intergated with FlexKV, we can achieve the following improvement: ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.

May you provide setup for your claim of improvement? What is the baseline here?

block_ids: The KV cacheblock IDs that were allocated.
"""

def wait_for_initialization(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have layer_pre_hook and layer_post_hook, can we have a plugin manager initialization hook?

prev_device_step_time = "N/A" # Handle first iteration
else:
prev_device_step_time = f"{prev_device_step_time}ms"
prev_device_step_time = f"{prev_device_step_time:.3f} ms"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I argue that this change is not related to the purpose of this merge request.

tp_size=tensorrt_llm.mpi_world_size(),
gpus_per_node=tensorrt_llm.default_gpus_per_node(),
rank=tensorrt_llm.mpi_rank())
executor_config.mapping = mapping
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code change should keep in mind that other types of connector exist too. Overwriting executor_config.mapping here may not be the intent of other Connectors. Maybe you can guard this exclusively for FlexKV connector?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31039 [ run ] completed with state SUCCESS. Commit: 8f2fee0
/LLM/main/L0_MergeRequest_PR pipeline #23982 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@axxx03
Copy link
Author

axxx03 commented Jan 9, 2026

In our case, when intergated with FlexKV, we can achieve the following improvement: ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.

May you provide setup for your claim of improvement? What is the baseline here?

Hardware

Clipboard_Screenshot_1767945339

Software

  • TensorRT-LLM: 1.1.0.rc3
  • FlexKV: 1.2.0

Scripts

The launch scripts:

MODEL_PATH=DeepSeek-V3.1

BATCH_SIZE=4
TP_SIZE=8
EP_SIZE=$TP_SIZE
MAX_SEQ_LEN=16384
MAX_NUM_TOKENS=16384

export FLEXKV_CONFIG_PATH="./flexkv_config.json"

trtllm-serve serve $MODEL_PATH \
    --host 0.0.0.0 \
    --port 6000 \
    --backend pytorch \
    --tp_size $TP_SIZE \
    --ep_size $EP_SIZE \
    --max_seq_len $MAX_SEQ_LEN \
    --max_num_tokens $MAX_NUM_TOKENS \
    --max_batch_size $BATCH_SIZE \
    --extra_llm_api_options extra-llm-api-config.yml

The config yaml:

cuda_graph_config:
  enable_padding: true
  batch_sizes:
    - 1
    - 2
    - 4
enable_chunked_prefill: true
kv_cache_config:
  enable_partial_reuse: false
  free_gpu_memory_fraction: 0.75
kv_connector_config:
  connector_module: "flexkv.integration.tensorrt_llm.trtllm_adapter"
  connector_scheduler_class: "FlexKVSchedulerConnector"
  connector_worker_class: "FlexKVWorkerConnector"
speculative_config: 
  decoding_type: MTP
  num_nextn_predict_layers: 1

Benchmark

TTFT(s) TPOT(s) QPM
TensorRT-LLM 0.42 25.82 125.36
TensorRT-LLM + FlexKV 0.35 21.24 161.66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants