NVIDIA
diff --git a/‎.github/workflows/blossom-ci.yml‎
Lines changed: 4 additions & 0 deletions b/‎.github/workflows/blossom-ci.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 0 additions & 1 deletion b/‎.gitignore‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎CODING_GUIDELINES.md‎
Lines changed: 145 additions & 30 deletions b/‎CODING_GUIDELINES.md‎
Lines changed: 145 additions & 30 deletions
diff --git a/‎cpp/CMakeLists.txt‎
Lines changed: 1 addition & 2 deletions b/‎cpp/CMakeLists.txt‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎cpp/cmake/modules/cuda_configuration.cmake‎
Lines changed: 17 additions & 16 deletions b/‎cpp/cmake/modules/cuda_configuration.cmake‎
Lines changed: 17 additions & 16 deletions
diff --git a/‎cpp/include/tensorrt_llm/batch_manager/llmRequest.h‎
Lines changed: 27 additions & 0 deletions b/‎cpp/include/tensorrt_llm/batch_manager/llmRequest.h‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎cpp/include/tensorrt_llm/executor/executor.h‎
Lines changed: 1 addition & 1 deletion b/‎cpp/include/tensorrt_llm/executor/executor.h‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎cpp/kernels/fmha_v2/setup.py‎
Lines changed: 2 additions & 2 deletions b/‎cpp/kernels/fmha_v2/setup.py‎
Lines changed: 2 additions & 2 deletions
@@ -191,6 +191,7 @@ jobs:
         "litaotju",
         "liyuhannnnn",
         "lkomali",
+        "longcheng-nv",
         "longlee0622",
         "lowsfer",
         "lucaslie",
@@ -293,6 +294,7 @@ jobs:
         "tcherckez-nvidia",
         "thorjohnsen",
         "tianyuxbear",
+        "tianyuz-nv",
         "tiffany940107",
         "tijyojwad",
         "timlee0212",
@@ -332,11 +334,13 @@ jobs:
         "xueweilnvidia",
         "xupinjie",
         "xuwchen",
+        "xwang233",
         "xxi-nv",
         "yali-arch",
         "yechank-nvidia",
         "yibinl-nvidia",
         "yifeizhang-c",
+        "YihuiLu512",
         "yihwang-nv",
         "yijingl-nvidia",
         "yilin-void",
 
@@ -62,7 +62,6 @@ tensorrt_llm/scripts
 *docs/source/_cpp_gen*
 docs/source/**/*.rst
 !docs/source/examples/index.rst
-!docs/source/deployment-guide/config_table.rst
 !docs/source/_includes/note_sections.rst
 *.swp
 
 
@@ -345,33 +345,22 @@ char const * const errStr = getErrorStr(status);
 ----
 
 ## Python Coding Guidelines
+Code should adhere to [PEP 8](https://peps.python.org/pep-0008/#fn-hi), unless otherwise noted.
+
 #### Python Standard
 1. The code developed for TensorRT-LLM should conform to Python 3.8+.
 
-#### Indentation
-1. Indent code with 4 spaces. Do not use tabs.
-
-#### Imports
-1. Always maintain the namespace when importing, even if only one class or function from a module is used.
-
-For example instead of:
+#### Formatting
 
-```python
-from package.subpackage.foo import SomeClass
-SomeClass()
-```
-or
-```python
-import package
-package.subpackage.foo.SomeClass()
-```
+1. Indent code with 4 spaces.  Do not use tabs.
+2. Code formatting is largely handled by the automatic tooling.  Do not override it unless it substantially improves readability.
+3. Note we have "legacy" files and "new" files that are formatted by different toolchains, see <pyproject.toml>.  This results in somewhat different formatting between the two classes of files.  Most notably legacy files are 80 characters wide while new files are 100.
 
-Do:
 
-```python
-from package.subpackage import foo
-foo.SomeClass()
-```
+#### Imports
+1. The linter will have opinions on import ordering.  Please follow them.
+2. Do not use wildcard imports.
+3. Despite the prohibition on wildcard imports, keep `__all__` updated to keep the public interface clearly documented.
 
 #### Naming
 
@@ -385,26 +374,29 @@ foo.SomeClass()
 3. Functions and Methods
 - snake_case: `def my_awesome_function():`
 
-4. Local Variables
+4. Local Variables or Mutable Global Variables
 - snake_case: `my_variable = ...`
-- prefix `k` for variable names that start with a number: `k_99th_percentile = ...`
+- Single-letter variables may also be uppercase, e.g. `N`, `T`.
+- Variables should not start with a number, but if you must, prefix with `k`, e.g. `k_99th_percentile = ...`
 
-5. Global Variables
-- upper snake_case and prefix `G`: `G_MY_GLOBAL = ...`
+5. Constants (any scope)
+- UPPER\_SNAKE\_CASE: `MY_CONSTANT = ...`
 
-6. Constants
-- upper snake_case: `MY_CONSTANT = ...`
+Variables and functions not part of a class’s or module’s public interface should be prefixed with an underscore.  Double underscores are permitted only if necessary to avoid name conflicts with inherited classes, and even then you should pursue alternatives.
 
 ##### Identifier Guidelines
 1. Avoid shadowing variables declared in an outer scope.
-2. Initialize all externally visible memberes of a class in the constructor.
+2. Initialize all externally visible members of a class in the constructor.
+3. For variables referencing “container” type objects that could live explicitly on the host or a GPU, e.g. referencing a Tensor, consider appending `_host` or `_device`/`_cuda` suffixes if the location is ambiguous.  Particularly if copies of the data exist in both locations.
 
 #### Comments
 
 1. For interfaces that may be used outside a file, prefer docstrings over comments.
 2. Comments should be reserved for code within a function, or interfaces that are local to a file.
+3. Avoid overcommenting.  Reserve comments for things that need explaining, or breaking up long sections of code into functional parts.  But in that case, consider helper functions.
+4. For arguments to functions in the public interface to a file, documentation of Tensor-like arguments should include the expected dimensions, e.g. `[batch, seq_len, hdim]`, and the allowed dtype options if dtype is constrained.
 
-### Pydantic Guidelines
+#### Pydantic Guidelines
 
 When defining any user-facing configuration classes (particularly `LlmArgs` or any class used in its fields), **always** use Pydantic classes rather than dataclasses or vanilla classes.
 
@@ -445,7 +437,7 @@ When defining any user-facing configuration classes (particularly `LlmArgs` or a
 ##### Classes and Functions
 Use the [Google style](https://google.github.io/styleguide/pyguide.html), which can be parsed by Sphinx.
 
-#####  Attributes and Variables
+##### Attributes and Variables
 Attributes and variables can be documented inline. Attribute docstrings will be rendered under the docstring for the class. For example:
 ```python
 class MyClass:
@@ -460,6 +452,9 @@ y = 2
 """<type>: Description of 'y'"""
 ```
 
+However, attribute docstrings are relatively rare and not expected.  Externally called functions should have docstrings, and their arguments should be documented.  Class initializer arguments especially should be documented.
+
+
 #### Avoid Reflection
 Avoid using reflection when functionality can be easily achieved without reflection.
 
@@ -524,6 +519,126 @@ else:
     f.read()
 ```
 
+Except in exceptional circumstances, use the built-in exception types.  For which type to use when, see [https://docs.python.org/3/library/exceptions.html](https://docs.python.org/3/library/exceptions.html). Use exceptions for error handling, not return values.  And despite the example above, prefer isinstance() to duck typing where possible.
+
+#### Static Typing
+
+1. Static type checking at pre-commit time is opt-in by submodule PICs. This is highly recommended because static type checking eliminates an entire class of bugs and makes your code more readable and maintainable overall.
+2. The presubmit system currently uses mypy.  However, many developers use pyright variants in their editors, so the code also has some `#pyright:` annotations.  As we don’t currently enforce pyright, maintaining these is best effort.  But if you notice they are broken, please fix them.
+3. Do not use `typing.Any` if you can avoid it. Similarly, avoid bypassing the type checker with `# type: ignore` annotations.
+4. Always annotate functions. Make the return type `None` if the function does not return anything (if you leave it empty, the type checker will infer the return type as `Any`).
+5. Annotate class members and other variables when necessary. Always annotate `dataclass` and `NamedTuple` members.
+
+```py
+class Foo:
+    def __init__(self, x: int) -> None:
+        self.x = x  # inferred as int, no extra annotation required
+        self.y: Optional[int] = None  # annotation required to prevent NoneType from being inferred
+```
+
+6.  Prefer using the built-in types `list`, `dict`, and `tuple` to the legacy `typing.List`, `typing.Dict`, and `typing.Tuple`. Similarly, use the `|` syntax instead of `typing.Union`.
+
+```py
+# Instead of
+def foo(x: List[int], y: Union[int, float]) -> None:
+    pass
+
+# Do:
+def foo(x: list[int], y: int | float) -> None:
+    pass
+```
+
+7. Prefer specifying argument types in `Callable`s.
+
+```py
+# Type checks, but not the best style
+def foo(c: Callable[..., int]) -> None:
+    c(42)
+
+# Best practice.
+def foo(c: Callable[[int], int]) -> None:
+    c(42)
+```
+
+8. Don’t annotate variables where it is obvious/not necessary.
+
+```py
+x: int = 42 # Not required
+```
+
+9. Prefer `Literal` to `str` when a fixed set of values is expected.
+
+```py
+# Works:
+def f(backend: str = "pytorch") -> None: pass
+
+# But this is preferred:
+def f(backend: Literal["pytorch", "tensorrt"] = "pytorch") -> None: pass
+```
+
+10. Use `@overload` when a return type depends on an input type. If the return type can be expressed using the input type, you can alternatively use a `TypeVar`.
+
+```py
+@overload
+def foo(a: str) -> int:
+    pass
+
+@overload
+def foo(a: float) -> float:
+    pass
+
+def foo(a: str | float) -> int | float:
+    if isinstance(a, str):
+        return 42
+    return 42.0
+
+def bar(a: float) -> None: pass
+
+bar(foo(1.0)) # This will type check thanks to @overload
+
+# In this example, the return type can be expressed as
+T = TypeVar("T")
+def baz(x: T) -> dict[str, T]:
+    return {"key": x}
+```
+
+11. Use a bounded TypeVar only when the type parameter appears in both input and return positions to preserve specific type information; if it appears only in the parameters, use the bound type directly.
+
+```py
+class Foo:
+    def f(self) -> None: pass
+
+class Bar(Foo): pass
+
+# Instead of:
+# T = TypeVar("T", bound=Foo)
+# def func(x: T) -> None:
+#     x.f()
+
+# We can just do:
+def func(x: Foo) -> None:
+    x.f()
+
+# Here, using a bound type var is actually useful. We prevent
+# func2 from losing type information.
+# def func2(x: Foo) -> Foo:
+#     return x
+# x = func2(Bar()) # Return type is Foo
+
+T = TypeVar("T", bound=Foo)
+def func2(x: T) -> T:
+    return x
+x = func2(Bar()) # Return type is Bar
+```
+
+12. Use typing.Protocol for duck typing. Prefer it when
+* You need an interface that third-party or unrelated classes can satisfy without inheriting from a base class.
+* You want to type-check that an object has specific methods/attributes without coupling to a class hierarchy.
+
+Do not use Protocol when a shared base class or ABC already exists and implementations naturally inherit from it — use the ABC directly. Also do not use it when you only need a union of concrete types — use Union or a type alias instead.
+
+Note that TypeVars can also be bound to `Protocol`s. Use this feature to specify the expected interface for an argument to a generic function if duck typing is desired.
+
 ## Documentation Guidelines
 
 #### CLI Options in Documentation
 
@@ -166,9 +166,8 @@ setup_cuda_architectures()
 set(CUDA_DRV_LIB CUDA::cuda_driver)
 set(CUDA_NVML_LIB CUDA::nvml)
 set(CUDA_RT_LIB CUDA::cudart_static)
-set(NVPTX_LIB CUDA::nvptxcompiler_static)
 
-set(CUDA_TOOLKIT_COMPONENTS cudart_static cuda_driver nvml nvptxcompiler_static)
+set(CUDA_TOOLKIT_COMPONENTS cudart_static cuda_driver nvml)
 
 if(CUBLAS_DYNAMIC_LINKING)
   set(CUBLAS_LIB CUDA::cublas)
 
@@ -1,5 +1,5 @@
 #
-# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION &
+# SPDX-FileCopyrightText: Copyright (c) 1993-2026 NVIDIA CORPORATION &
 # AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0
 #
 # Licensed under the Apache License, Version 2.0 (the "License"); you may not
@@ -363,21 +363,6 @@ function(setup_cuda_architectures)
       ${CMAKE_CUDA_ARCHITECTURES_ORIG}
       PARENT_SCOPE)
 
-  set(ARCHITECTURES_WITH_KERNELS
-      80
-      86
-      89
-      90
-      100
-      103
-      120)
-  foreach(CUDA_ARCH IN LISTS ARCHITECTURES_WITH_KERNELS)
-    if(NOT ${CUDA_ARCH} IN_LIST CMAKE_CUDA_ARCHITECTURES_ORIG)
-      add_definitions("-DEXCLUDE_SM_${CUDA_ARCH}")
-      message(STATUS "Excluding SM ${CUDA_ARCH}")
-    endif()
-  endforeach()
-
   # -a suffix supported from Hopper (90)
   set(CMAKE_CUDA_MIN_ARCHITECTURE_HAS_ACCEL 90)
   set(CMAKE_CUDA_MIN_ARCHITECTURE_HAS_ACCEL
@@ -452,6 +437,22 @@ function(setup_cuda_architectures)
     endif()
   endforeach()
 
+  set(ARCHITECTURES_WITH_KERNELS
+      80
+      86
+      89
+      90
+      100
+      103
+      120)
+  foreach(CUDA_ARCH IN LISTS ARCHITECTURES_WITH_KERNELS)
+    if(NOT ${CUDA_ARCH} IN_LIST CMAKE_CUDA_ARCHITECTURES_ORIG
+       AND NOT ${CUDA_ARCH} IN_LIST CMAKE_CUDA_ARCHITECTURES_NORMALIZED_LIST)
+      add_definitions("-DEXCLUDE_SM_${CUDA_ARCH}")
+      message(STATUS "Excluding SM ${CUDA_ARCH}")
+    endif()
+  endforeach()
+
   # Apply suffixes based on architecture capabilities
   set(CMAKE_CUDA_ARCHITECTURES_NORMALIZED)
   set(CMAKE_CUDA_ARCHITECTURES_FAMILIES)
 
@@ -866,6 +866,7 @@ class GenericLlmRequest
         mPrepopulatedPromptLenDraft = 0;
         mContextChunkSizeTarget = mPromptLen;
         mContextChunkSizeDraft = mPromptLen;
+        mEstimatedReusableTokens = 0;
         mSeqSlot.reset();
     }
 
@@ -1135,6 +1136,23 @@ class GenericLlmRequest
         }
     }
 
+    /// @brief Get the estimated number of reusable tokens from the KV cache.
+    /// @details Set by the capacity scheduler so the micro batch scheduler can
+    ///          account for cached tokens in its token budget. For subsequent
+    ///          context chunks, this returns 0 because contextRemainingLength
+    ///          already reflects the advancement from setPrepopulatedPromptLen.
+    [[nodiscard]] SizeType32 getEstimatedReusableTokens() const noexcept
+    {
+        return mEstimatedReusableTokens;
+    }
+
+    /// @brief Set the estimated number of reusable tokens. Const because
+    ///        the field is mutable (it's a scheduling cache, not request state).
+    void setEstimatedReusableTokens(SizeType32 estimatedReusableTokens) const noexcept
+    {
+        mEstimatedReusableTokens = estimatedReusableTokens;
+    }
+
     void setDraftTokens(std::shared_ptr<VecTokens> const& draftTokens)
     {
         mDraftTokens = draftTokens;
@@ -1964,6 +1982,15 @@ class GenericLlmRequest
     SizeType32 mPrepopulatedPromptLenTarget{0};
     SizeType32 mPrepopulatedPromptLenDraft{0};
 
+    // Estimated number of reusable tokens from the KV cache radix tree.
+    // Set by the capacity scheduler (during getNeededBlocksOneStep /
+    // getRemainingBlocksToCompletion) so that the micro batch scheduler
+    // can account for cached tokens when computing the token budget.
+    // Marked mutable because it is a cache/estimate set during const
+    // capacity-scheduler queries. Reset to 0 after addSequence sets
+    // the authoritative mPrepopulatedPromptLen and advances context position.
+    mutable SizeType32 mEstimatedReusableTokens{0};
+
     SizeType32 mMaxSentTokenLen;
 
     std::optional<TensorPtr> mEmbeddingBias{std::nullopt};
 
@@ -1533,7 +1533,7 @@ class ExecutorConfig
     static constexpr SizeType32 kDefaultRequestStatsMaxIterations = 0;
 
     explicit ExecutorConfig(SizeType32 maxBeamWidth = 1, SchedulerConfig schedulerConfig = SchedulerConfig(),
-        KvCacheConfig kvCacheConfig = KvCacheConfig(), bool enableChunkedContext = true, bool normalizeLogProbs = true,
+        KvCacheConfig kvCacheConfig = KvCacheConfig(), bool enableChunkedContext = true, bool normalizeLogProbs = false,
         SizeType32 iterStatsMaxIterations = kDefaultIterStatsMaxIterations,
         SizeType32 requestStatsMaxIterations = kDefaultRequestStatsMaxIterations,
         BatchingType batchingType = BatchingType::kINFLIGHT, std::optional<SizeType32> maxBatchSize = std::nullopt,
 
@@ -6792,11 +6792,11 @@ def enumerate_kernels():
                                      head_size_v=512)
         enumerate_qmma_kernels(specs, sm=120)
         enumerate_qmma_flash_kernels(specs, sm=120, dtype='e4m3_fp32')
-        # Add bf16 output MLA kernels.
+        # Add bf16 output kernels for e4m3 input (MLA and standard head sizes).
         enumerate_qmma_flash_kernels(specs,
                                      sm=120,
                                      dtype='e4m3_fp32',
-                                     head_sizes=[192, 576],
+                                     head_sizes=[128, 192, 576],
                                      output_dtype="bf16")
 
     if 'ENABLE_HMMA_FP32' in os.environ: