-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] Support using FlexKV as anothor KV Cache Offloading option. #9698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds KV-cache connector matched token tracking across C++ and Python APIs via new accessors and methods on LlmRequest. Introduces FlexKV feature support in PyExecutor with resource management integration. Extends Python bindings to expose new properties. Adds helper function for test configuration and test coverage for connector token matching. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
174-184: Fix undefinedexecutor_configreference in_get_mapping.The new side-effect
executor_config.mapping = mappingwill currently fail Ruff (F821) and can raiseNameErrorat runtime, sinceexecutor_configis neither defined nor imported in this module.If the intent is to keep a global executor configuration in sync (presumably the one defined in
._util), explicitly import it and use that symbol:-from ._util import (KvCacheCreator, _adjust_torch_mem_fraction, - create_py_executor_instance, instantiate_sampler, is_mla, - validate_feature_combination) +from ._util import (KvCacheCreator, _adjust_torch_mem_fraction, + create_py_executor_instance, executor_config, + instantiate_sampler, is_mla, + validate_feature_combination) @@ def _get_mapping(_mapping: Mapping) -> Mapping: if _mapping is None: mapping = Mapping(world_size=tensorrt_llm.mpi_world_size(), tp_size=tensorrt_llm.mpi_world_size(), gpus_per_node=tensorrt_llm.default_gpus_per_node(), rank=tensorrt_llm.mpi_rank()) - executor_config.mapping = mapping + executor_config.mapping = mappingIf
executor_configactually lives elsewhere, adjust the import target accordingly or remove this assignment and wire the mapping through the correct configuration object.
🧹 Nitpick comments (4)
tests/integration/defs/conftest.py (1)
96-103: Alignllm_model_namedocstring with its actual behavior.The function correctly returns
LLM_MODELS_NAMEor defaults to"Qwen2-0.5B", but the docstring still refers to asserting on invalid paths, which never happens here. Consider updating it for clarity:-def llm_model_name() -> str: - '''return LLM_MODELS_NAME if it is set in env, assert when it's set but not a valid path - ''' - DEFAULT_LLM_MODEL_NAME = "Qwen2-0.5B" - LLM_MODELS_NAME = os.environ.get("LLM_MODELS_NAME", DEFAULT_LLM_MODEL_NAME) - - return LLM_MODELS_NAME +def llm_model_name() -> str: + """Return the model name from LLM_MODELS_NAME env var, or a default.""" + default_llm_model_name = "Qwen2-0.5B" + return os.environ.get("LLM_MODELS_NAME", default_llm_model_name)This keeps behavior identical while making the intent clearer.
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
1067-1075: New connector token tracking API andisFinishedNormallook correct; consider a small readability tweak.
mNumConnectorMatchedTokensis safely initialized to 0 and the simple getter/setter are consistent with existing style.isFinishedNormal()correctly classifies “normal” finishes as{kEND_ID, kSTOP_WORDS, kLENGTH}and will return false if any beam is stillkNOT_FINISHEDor finished abnormally, which matches the intent suggested by the enum documentation.- Minor readability nit: the lambda in
isFinishedNormaluses\line continuations; you can drop the backslashes and format the chained||expressions across lines normally to avoid preprocessor-style artifacts in regular C++ code.Also applies to: 1680-1687, 1942-1944
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
1420-1432: Connector-matched token propagation is wired correctly; double-check whether reuse metrics should include them.
- Integrating
getNumNewMatchedTokensand feedingprepopulatedPromptLen + numConnectorMatchedTokensintosetPrepopulatedPromptLen(..., getTokensPerBlock())is consistent with the existing chunking / KV-block invariants and with the guidance to operate in units oftokensPerBlockrather than current block fill. Based on learnings, this is the right place to adjust the effective prepopulated length.- Storing
numConnectorMatchedTokensonto theLlmRequestviasetNumConnectorMatchedTokensgives clean observability for downstream layers (bindings/tests) without perturbing the rest of the prefill logic.One open design point to confirm:
mReusedTokensand related hit-rate metrics are still updated only withprepopulatedPromptLen, so connector-driven reused tokens are not reflected in those counters. If the goal is for cache statistics to represent all KV reuse (including FlexKV / connectors), you may want to addnumConnectorMatchedTokensintomReusedTokensas well; if not, consider a brief comment clarifying that connector reuse is intentionally excluded from the core KV-cache metrics.tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
339-343: Consider adding a timeout to prevent indefinite blocking.The busy-wait loop will block indefinitely if the FlexKV manager never becomes ready, which could cause the initialization to hang.
Consider adding a timeout:
def _wait_for_flexkv_manager(self): if self.kv_connector_manager is not None and self.dist.rank == 0: + max_wait_time = 300 # 5 minutes + start_time = time.time() while not self.kv_connector_manager.scheduler.is_ready(): + if time.time() - start_time > max_wait_time: + raise RuntimeError("FlexKV manager failed to become ready within timeout") time.sleep(0.1) logger.info("FlexKV manager is ready")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp(1 hunks)cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(8 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(1 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tests/integration/defs/conftest.py(1 hunks)tests/integration/defs/llmapi/test_llm_api_connector.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/pybind/batch_manager/bindings.cppcpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
cpp/tensorrt_llm/pybind/batch_manager/bindings.cpptensorrt_llm/_torch/pyexecutor/resource_manager.pycpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pycpp/tensorrt_llm/nanobind/batch_manager/bindings.cpptests/integration/defs/conftest.pytests/integration/defs/llmapi/test_llm_api_connector.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytests/integration/defs/conftest.pytests/integration/defs/llmapi/test_llm_api_connector.py
**/*.h
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.h: Use a preprocessor guard in C++ header files with the guard name formatTRTLLM_followed by the filename in all caps (e.g.,TRTLLM_FOO_BAR_HELLO_Hfor fileFooBarHello.h); do not include directory names in the symbol
Do not use underscore prefix or suffix in C++ preprocessor guard symbols; they are reserved in C++ standard for compilers or implementation
Files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
🧠 Learnings (15)
📓 Common learnings
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-21T09:41:49.347Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:2010-2045
Timestamp: 2025-08-21T09:41:49.347Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is specifically for updating bookkeeping when blocks are added during the context phase, not for refreshing offsets after detach operations. During detach operations, GenerationRequest::removeFrontBlock handles the necessary cache block bookkeeping internally.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
📚 Learning: 2025-08-20T06:48:45.368Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h:0-0
Timestamp: 2025-08-20T06:48:45.368Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, updateSequenceCacheBlockOffsets is only called when adding a sequence, not during detach operations. During detach, the cache block bookkeeping is handled by GenerationRequest::removeFrontBlock.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.
Applied to files:
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tests/integration/defs/llmapi/test_llm_api_connector.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest(437-663)tensorrt_llm/_torch/speculative/mtp.py (1)
free_resources(81-86)tensorrt_llm/_torch/speculative/eagle3.py (1)
free_resources(92-96)tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)
free_resources(139-143)free_resources(242-244)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
cpp/include/tensorrt_llm/executor/types.h (1)
FinishReason(503-598)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)
handle_metadata(475-481)cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
LlmRequestState(47-210)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
free_slot_only(1409-1416)
tests/integration/defs/llmapi/test_llm_api_connector.py (1)
tests/integration/defs/conftest.py (2)
llm_models_root(80-94)llm_model_name(96-102)
🪛 Ruff (0.14.7)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1147-1147: f-string without any placeholders
Remove extraneous f prefix
(F541)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
180-180: Undefined name executor_config
(F821)
tests/integration/defs/llmapi/test_llm_api_connector.py
436-436: Unused function argument: enforce_single_worker
(ARG001)
🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1409-1416: Slot-only free path looks correct; ensure SEQ_SLOT_MANAGER.free_resources is idempotent.The helper cleanly targets only
ResourceManagerType.SEQ_SLOT_MANAGER, matching the intent to release sequence slots early. This assumes the underlyingSEQ_SLOT_MANAGER.free_resources(request)safely handles multiple invocations for the same request (once viafree_slot_only, later via the generalfree_resourcespipeline) without raising or leaking state. Please confirm that its implementation is idempotent or guarded accordingly.cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp (1)
168-173: Python bindings cleanly expose the new request state without mutability.The
is_finished_normalandnum_connector_matched_tokensread-only properties correctly mirror the new C++ accessors and follow existing naming conventions; no issues spotted.cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
164-169: Nanobind surface matches pybind and C++ API for the new properties.
is_finished_normalandnum_connector_matched_tokensare exposed as read-only fields, keeping parity with the pybind bindings and the underlying C++ interface.tests/integration/defs/llmapi/test_llm_api_connector.py (1)
434-467: LGTM! Well-structured test for connector matched tokens.The test effectively verifies both the getter (
num_connector_matched_tokens) and setter (setNumConnectorMatchedTokens) functionality with good parametrization covering edge cases.tensorrt_llm/_torch/pyexecutor/py_executor.py (4)
299-299: LGTM! FlexKV feature flag correctly initialized.The environment variable pattern for feature enablement is appropriate.
1187-1189: LGTM! Consistent FlexKV task refresh integration.The refresh mechanism is correctly integrated in both executor loops to handle pending FlexKV tasks when no batch is scheduled.
Also applies to: 1415-1417
2421-2422: LGTM! Early slot release for FlexKV resource management.The early slot release when requests finish under FlexKV enables better resource utilization by freeing slots before the put task completes.
589-589: LGTM! Improved profiling time formatting.The consistent 3-decimal-place formatting improves readability of profiling logs.
Also applies to: 599-599
| def _kv_connector_refresh_unfinished_tasks(self): | ||
| if not self.use_flexkv: | ||
| return | ||
| if len(self.active_requests) == 0: | ||
| return | ||
| if not self.kv_connector_manager: | ||
| return | ||
| logger.warning(f"No scheduled requests, but flexkv have pending put requests") | ||
| self.kv_connector_manager.handle_metadata() | ||
| time.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix f-string and grammar in warning message.
Line 1147 uses an f-string without placeholders and contains a grammatical error.
Apply this diff:
- logger.warning(f"No scheduled requests, but flexkv have pending put requests")
+ logger.warning("No scheduled requests, but flexkv has pending put requests")🧰 Tools
🪛 Ruff (0.14.7)
1147-1147: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1140 to 1149, the
logger.warning call on line 1147 uses an unnecessary f-string and has a grammar
issue; replace the f-string with a normal string and correct the message to
something like "No scheduled requests, but FlexKV has pending put requests"
(preserve punctuation and capitalization as preferred) so the log reads clearly
and without unused interpolation.
| # limitations under the License. | ||
|
|
||
| import math | ||
| import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused import.
The os module is imported but not used anywhere in this file.
Apply this diff:
-import os📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import os |
🤖 Prompt for AI Agents
In tests/integration/defs/llmapi/test_llm_api_connector.py around line 17, the
file imports the os module but it is not used anywhere; remove the unused import
line (delete or omit the "import os") so the file no longer contains an unused
import and the test module passes linting.
|
@thorjohnsen Could you please help review this? Thanks! |
|
@eopXD Please also review this. thanks |
jthomson04
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar thoughts here as @Shixiaowei02. We should not have flexkv-specific code in the pyexecutor. The entire point of the KV connector is to prevent this sort of thing. If the current KV connector interface isn't sufficient for flexKV, then we can discuss augmenting/improving it. But we should not be adding flexKV-specific "hacks" to make it work.
| if (mKvCacheConnectorManager && !llmRequest.isDummyRequest()) | ||
| { | ||
| numConnectorMatchedTokens = mKvCacheConnectorManager->getNumNewMatchedTokens(llmRequest, prepopulatedPromptLen); | ||
| llmRequest.setNumConnectorMatchedTokens(numConnectorMatchedTokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? The kv connector should already have this knowledge from when get_num_new_matched_tokens was called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flexKV adapter wants to access this data on the Python side, as shown here.
However, TensorRT-LLM seems to only retrieve it on the C++ code, and the data is not passed back to Python. Therefore, I added this interface.
| [](auto reason) { return reason == executor::FinishReason::kLENGTH; }); | ||
| } | ||
|
|
||
| [[nodiscard]] bool isFinishedNormal() const noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used? Is this field (via the bindings) only accessed inside the flexKV implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently it’s only accessed inside the flexKV implementation, which can be found here. Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to infer that value from the other fields? We ideally don't include new fields in TRTLLM that are exclusively accessed by a specific kv connector implementation.
| self.kv_connector_manager.worker.start_load_kv( | ||
| torch.cuda.current_stream()) | ||
|
|
||
| def _kv_connector_refresh_unfinished_tasks(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? The py executor continuously polls get_finished even when there are no scheduled requests. That should serve as the point where you refresh unfinished tasks.
Sorry for the late reply, and thanks for the review and feedback. @jthomson04 @Shixiaowei02 |
|
/bot run |
|
PR_Github #29945 [ run ] triggered by Bot. Commit: |
|
PR_Github #29945 [ run ] completed with state
|
cefe957 to
4c2247d
Compare
|
/bot run |
eopXD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! e2e test case will be appreciated. Please try to add them in files like l0_h100.yml
| [](auto reason) { return reason == executor::FinishReason::kLENGTH; }); | ||
| } | ||
|
|
||
| [[nodiscard]] bool isFinishedNormal() const noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "normal" here? May you modify the function name to make it more expressive?
| if self.kv_connector_manager: | ||
| self.kv_connector_manager.take_scheduled_requests_pending_load( | ||
| scheduled_batch) | ||
| self.kv_connector_manager.handle_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May you explain why is this removed?
Hi, I found that all tests in |
Ideally we should have a test on the specific model (possible gemma3, llama3?) and specific platform (h100, a10), where kv cache offload to FlexKV is helpful. |
4c2247d to
38ac65c
Compare
Hi, I have removed the interface in the test function, so there's no need to test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces significant breaking changes into the kv connector interface.
|
|
||
| can_queue = self._can_queue(scheduled_batch) | ||
|
|
||
| if self.kv_connector_manager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we moving handle_metadata around? This will break all other implementations of the kv connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we moving handle_metadata around? This will break all other implementations of the kv connector.
Can the unit tests already added to TRT-LLM detect and ensure that functionality is not broken?
2. support flexkv + cuda graph
Signed-off-by: scutizhang <[email protected]>
38ac65c to
8f2fee0
Compare
|
/bot run |
|
PR_Github #30974 [ run ] triggered by Bot. Commit: |
|
PR_Github #30974 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #31039 [ run ] triggered by Bot. Commit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our case, when intergated with FlexKV, we can achieve the following improvement: ISL=8K, OSL=100, batch_size=8: TTFT decreases by 16.7%, and QPM increases by 28.9%.
May you provide setup for your claim of improvement? What is the baseline here?
| block_ids: The KV cacheblock IDs that were allocated. | ||
| """ | ||
|
|
||
| def wait_for_initialization(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have layer_pre_hook and layer_post_hook, can we have a plugin manager initialization hook?
| prev_device_step_time = "N/A" # Handle first iteration | ||
| else: | ||
| prev_device_step_time = f"{prev_device_step_time}ms" | ||
| prev_device_step_time = f"{prev_device_step_time:.3f} ms" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I argue that this change is not related to the purpose of this merge request.
| tp_size=tensorrt_llm.mpi_world_size(), | ||
| gpus_per_node=tensorrt_llm.default_gpus_per_node(), | ||
| rank=tensorrt_llm.mpi_rank()) | ||
| executor_config.mapping = mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code change should keep in mind that other types of connector exist too. Overwriting executor_config.mapping here may not be the intent of other Connectors. Maybe you can guard this exclusively for FlexKV connector?
|
PR_Github #31039 [ run ] completed with state
|

Description
FlexKV is a distributed KV store and multi-level cache management system developed by Tencent Cloud's TACO team in collaboration with the community, designed for large-scale LLM inference scenarios. FlexKV leverages multi-level caching to enable inference engines to achieve higher throughput and lower latency.
In our case, when intergated with FlexKV, we can achieve the following improvement:
Co-authored-by: peaceforeverCN