Skip to content

Commit eeddeaa

Browse files
authored
Merge branch 'main' into user/o-stoner/visual-gen-enable-cachedit
2 parents 81ea0eb + 93d99f1 commit eeddeaa

File tree

598 files changed

+29097
-10266
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

598 files changed

+29097
-10266
lines changed

.github/workflows/blossom-ci.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@ jobs:
191191
"litaotju",
192192
"liyuhannnnn",
193193
"lkomali",
194+
"longcheng-nv",
194195
"longlee0622",
195196
"lowsfer",
196197
"lucaslie",
@@ -293,6 +294,7 @@ jobs:
293294
"tcherckez-nvidia",
294295
"thorjohnsen",
295296
"tianyuxbear",
297+
"tianyuz-nv",
296298
"tiffany940107",
297299
"tijyojwad",
298300
"timlee0212",
@@ -332,11 +334,13 @@ jobs:
332334
"xueweilnvidia",
333335
"xupinjie",
334336
"xuwchen",
337+
"xwang233",
335338
"xxi-nv",
336339
"yali-arch",
337340
"yechank-nvidia",
338341
"yibinl-nvidia",
339342
"yifeizhang-c",
343+
"YihuiLu512",
340344
"yihwang-nv",
341345
"yijingl-nvidia",
342346
"yilin-void",

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ tensorrt_llm/scripts
6262
*docs/source/_cpp_gen*
6363
docs/source/**/*.rst
6464
!docs/source/examples/index.rst
65-
!docs/source/deployment-guide/config_table.rst
6665
!docs/source/_includes/note_sections.rst
6766
*.swp
6867

CODING_GUIDELINES.md

Lines changed: 145 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -345,33 +345,22 @@ char const * const errStr = getErrorStr(status);
345345
----
346346

347347
## Python Coding Guidelines
348+
Code should adhere to [PEP 8](https://peps.python.org/pep-0008/#fn-hi), unless otherwise noted.
349+
348350
#### Python Standard
349351
1. The code developed for TensorRT-LLM should conform to Python 3.8+.
350352

351-
#### Indentation
352-
1. Indent code with 4 spaces. Do not use tabs.
353-
354-
#### Imports
355-
1. Always maintain the namespace when importing, even if only one class or function from a module is used.
356-
357-
For example instead of:
353+
#### Formatting
358354

359-
```python
360-
from package.subpackage.foo import SomeClass
361-
SomeClass()
362-
```
363-
or
364-
```python
365-
import package
366-
package.subpackage.foo.SomeClass()
367-
```
355+
1. Indent code with 4 spaces. Do not use tabs.
356+
2. Code formatting is largely handled by the automatic tooling. Do not override it unless it substantially improves readability.
357+
3. Note we have "legacy" files and "new" files that are formatted by different toolchains, see <pyproject.toml>. This results in somewhat different formatting between the two classes of files. Most notably legacy files are 80 characters wide while new files are 100.
368358

369-
Do:
370359

371-
```python
372-
from package.subpackage import foo
373-
foo.SomeClass()
374-
```
360+
#### Imports
361+
1. The linter will have opinions on import ordering. Please follow them.
362+
2. Do not use wildcard imports.
363+
3. Despite the prohibition on wildcard imports, keep `__all__` updated to keep the public interface clearly documented.
375364

376365
#### Naming
377366

@@ -385,26 +374,29 @@ foo.SomeClass()
385374
3. Functions and Methods
386375
- snake_case: `def my_awesome_function():`
387376

388-
4. Local Variables
377+
4. Local Variables or Mutable Global Variables
389378
- snake_case: `my_variable = ...`
390-
- prefix `k` for variable names that start with a number: `k_99th_percentile = ...`
379+
- Single-letter variables may also be uppercase, e.g. `N`, `T`.
380+
- Variables should not start with a number, but if you must, prefix with `k`, e.g. `k_99th_percentile = ...`
391381

392-
5. Global Variables
393-
- upper snake_case and prefix `G`: `G_MY_GLOBAL = ...`
382+
5. Constants (any scope)
383+
- UPPER\_SNAKE\_CASE: `MY_CONSTANT = ...`
394384

395-
6. Constants
396-
- upper snake_case: `MY_CONSTANT = ...`
385+
Variables and functions not part of a class’s or module’s public interface should be prefixed with an underscore. Double underscores are permitted only if necessary to avoid name conflicts with inherited classes, and even then you should pursue alternatives.
397386

398387
##### Identifier Guidelines
399388
1. Avoid shadowing variables declared in an outer scope.
400-
2. Initialize all externally visible memberes of a class in the constructor.
389+
2. Initialize all externally visible members of a class in the constructor.
390+
3. For variables referencing “container” type objects that could live explicitly on the host or a GPU, e.g. referencing a Tensor, consider appending `_host` or `_device`/`_cuda` suffixes if the location is ambiguous. Particularly if copies of the data exist in both locations.
401391

402392
#### Comments
403393

404394
1. For interfaces that may be used outside a file, prefer docstrings over comments.
405395
2. Comments should be reserved for code within a function, or interfaces that are local to a file.
396+
3. Avoid overcommenting. Reserve comments for things that need explaining, or breaking up long sections of code into functional parts. But in that case, consider helper functions.
397+
4. For arguments to functions in the public interface to a file, documentation of Tensor-like arguments should include the expected dimensions, e.g. `[batch, seq_len, hdim]`, and the allowed dtype options if dtype is constrained.
406398

407-
### Pydantic Guidelines
399+
#### Pydantic Guidelines
408400

409401
When defining any user-facing configuration classes (particularly `LlmArgs` or any class used in its fields), **always** use Pydantic classes rather than dataclasses or vanilla classes.
410402

@@ -445,7 +437,7 @@ When defining any user-facing configuration classes (particularly `LlmArgs` or a
445437
##### Classes and Functions
446438
Use the [Google style](https://google.github.io/styleguide/pyguide.html), which can be parsed by Sphinx.
447439

448-
##### Attributes and Variables
440+
##### Attributes and Variables
449441
Attributes and variables can be documented inline. Attribute docstrings will be rendered under the docstring for the class. For example:
450442
```python
451443
class MyClass:
@@ -460,6 +452,9 @@ y = 2
460452
"""<type>: Description of 'y'"""
461453
```
462454

455+
However, attribute docstrings are relatively rare and not expected. Externally called functions should have docstrings, and their arguments should be documented. Class initializer arguments especially should be documented.
456+
457+
463458
#### Avoid Reflection
464459
Avoid using reflection when functionality can be easily achieved without reflection.
465460

@@ -524,6 +519,126 @@ else:
524519
f.read()
525520
```
526521

522+
Except in exceptional circumstances, use the built-in exception types. For which type to use when, see [https://docs.python.org/3/library/exceptions.html](https://docs.python.org/3/library/exceptions.html). Use exceptions for error handling, not return values. And despite the example above, prefer isinstance() to duck typing where possible.
523+
524+
#### Static Typing
525+
526+
1. Static type checking at pre-commit time is opt-in by submodule PICs. This is highly recommended because static type checking eliminates an entire class of bugs and makes your code more readable and maintainable overall.
527+
2. The presubmit system currently uses mypy. However, many developers use pyright variants in their editors, so the code also has some `#pyright:` annotations. As we don’t currently enforce pyright, maintaining these is best effort. But if you notice they are broken, please fix them.
528+
3. Do not use `typing.Any` if you can avoid it. Similarly, avoid bypassing the type checker with `# type: ignore` annotations.
529+
4. Always annotate functions. Make the return type `None` if the function does not return anything (if you leave it empty, the type checker will infer the return type as `Any`).
530+
5. Annotate class members and other variables when necessary. Always annotate `dataclass` and `NamedTuple` members.
531+
532+
```py
533+
class Foo:
534+
def __init__(self, x: int) -> None:
535+
self.x = x # inferred as int, no extra annotation required
536+
self.y: Optional[int] = None # annotation required to prevent NoneType from being inferred
537+
```
538+
539+
6. Prefer using the built-in types `list`, `dict`, and `tuple` to the legacy `typing.List`, `typing.Dict`, and `typing.Tuple`. Similarly, use the `|` syntax instead of `typing.Union`.
540+
541+
```py
542+
# Instead of
543+
def foo(x: List[int], y: Union[int, float]) -> None:
544+
pass
545+
546+
# Do:
547+
def foo(x: list[int], y: int | float) -> None:
548+
pass
549+
```
550+
551+
7. Prefer specifying argument types in `Callable`s.
552+
553+
```py
554+
# Type checks, but not the best style
555+
def foo(c: Callable[..., int]) -> None:
556+
c(42)
557+
558+
# Best practice.
559+
def foo(c: Callable[[int], int]) -> None:
560+
c(42)
561+
```
562+
563+
8. Don’t annotate variables where it is obvious/not necessary.
564+
565+
```py
566+
x: int = 42 # Not required
567+
```
568+
569+
9. Prefer `Literal` to `str` when a fixed set of values is expected.
570+
571+
```py
572+
# Works:
573+
def f(backend: str = "pytorch") -> None: pass
574+
575+
# But this is preferred:
576+
def f(backend: Literal["pytorch", "tensorrt"] = "pytorch") -> None: pass
577+
```
578+
579+
10. Use `@overload` when a return type depends on an input type. If the return type can be expressed using the input type, you can alternatively use a `TypeVar`.
580+
581+
```py
582+
@overload
583+
def foo(a: str) -> int:
584+
pass
585+
586+
@overload
587+
def foo(a: float) -> float:
588+
pass
589+
590+
def foo(a: str | float) -> int | float:
591+
if isinstance(a, str):
592+
return 42
593+
return 42.0
594+
595+
def bar(a: float) -> None: pass
596+
597+
bar(foo(1.0)) # This will type check thanks to @overload
598+
599+
# In this example, the return type can be expressed as
600+
T = TypeVar("T")
601+
def baz(x: T) -> dict[str, T]:
602+
return {"key": x}
603+
```
604+
605+
11. Use a bounded TypeVar only when the type parameter appears in both input and return positions to preserve specific type information; if it appears only in the parameters, use the bound type directly.
606+
607+
```py
608+
class Foo:
609+
def f(self) -> None: pass
610+
611+
class Bar(Foo): pass
612+
613+
# Instead of:
614+
# T = TypeVar("T", bound=Foo)
615+
# def func(x: T) -> None:
616+
# x.f()
617+
618+
# We can just do:
619+
def func(x: Foo) -> None:
620+
x.f()
621+
622+
# Here, using a bound type var is actually useful. We prevent
623+
# func2 from losing type information.
624+
# def func2(x: Foo) -> Foo:
625+
# return x
626+
# x = func2(Bar()) # Return type is Foo
627+
628+
T = TypeVar("T", bound=Foo)
629+
def func2(x: T) -> T:
630+
return x
631+
x = func2(Bar()) # Return type is Bar
632+
```
633+
634+
12. Use typing.Protocol for duck typing. Prefer it when
635+
* You need an interface that third-party or unrelated classes can satisfy without inheriting from a base class.
636+
* You want to type-check that an object has specific methods/attributes without coupling to a class hierarchy.
637+
638+
Do not use Protocol when a shared base class or ABC already exists and implementations naturally inherit from it — use the ABC directly. Also do not use it when you only need a union of concrete types — use Union or a type alias instead.
639+
640+
Note that TypeVars can also be bound to `Protocol`s. Use this feature to specify the expected interface for an argument to a generic function if duck typing is desired.
641+
527642
## Documentation Guidelines
528643

529644
#### CLI Options in Documentation

cpp/CMakeLists.txt

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -166,9 +166,8 @@ setup_cuda_architectures()
166166
set(CUDA_DRV_LIB CUDA::cuda_driver)
167167
set(CUDA_NVML_LIB CUDA::nvml)
168168
set(CUDA_RT_LIB CUDA::cudart_static)
169-
set(NVPTX_LIB CUDA::nvptxcompiler_static)
170169

171-
set(CUDA_TOOLKIT_COMPONENTS cudart_static cuda_driver nvml nvptxcompiler_static)
170+
set(CUDA_TOOLKIT_COMPONENTS cudart_static cuda_driver nvml)
172171

173172
if(CUBLAS_DYNAMIC_LINKING)
174173
set(CUBLAS_LIB CUDA::cublas)

cpp/cmake/modules/cuda_configuration.cmake

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#
2-
# SPDX-FileCopyrightText: Copyright (c) 1993-2022 NVIDIA CORPORATION &
2+
# SPDX-FileCopyrightText: Copyright (c) 1993-2026 NVIDIA CORPORATION &
33
# AFFILIATES. All rights reserved. SPDX-License-Identifier: Apache-2.0
44
#
55
# Licensed under the Apache License, Version 2.0 (the "License"); you may not
@@ -363,21 +363,6 @@ function(setup_cuda_architectures)
363363
${CMAKE_CUDA_ARCHITECTURES_ORIG}
364364
PARENT_SCOPE)
365365

366-
set(ARCHITECTURES_WITH_KERNELS
367-
80
368-
86
369-
89
370-
90
371-
100
372-
103
373-
120)
374-
foreach(CUDA_ARCH IN LISTS ARCHITECTURES_WITH_KERNELS)
375-
if(NOT ${CUDA_ARCH} IN_LIST CMAKE_CUDA_ARCHITECTURES_ORIG)
376-
add_definitions("-DEXCLUDE_SM_${CUDA_ARCH}")
377-
message(STATUS "Excluding SM ${CUDA_ARCH}")
378-
endif()
379-
endforeach()
380-
381366
# -a suffix supported from Hopper (90)
382367
set(CMAKE_CUDA_MIN_ARCHITECTURE_HAS_ACCEL 90)
383368
set(CMAKE_CUDA_MIN_ARCHITECTURE_HAS_ACCEL
@@ -452,6 +437,22 @@ function(setup_cuda_architectures)
452437
endif()
453438
endforeach()
454439

440+
set(ARCHITECTURES_WITH_KERNELS
441+
80
442+
86
443+
89
444+
90
445+
100
446+
103
447+
120)
448+
foreach(CUDA_ARCH IN LISTS ARCHITECTURES_WITH_KERNELS)
449+
if(NOT ${CUDA_ARCH} IN_LIST CMAKE_CUDA_ARCHITECTURES_ORIG
450+
AND NOT ${CUDA_ARCH} IN_LIST CMAKE_CUDA_ARCHITECTURES_NORMALIZED_LIST)
451+
add_definitions("-DEXCLUDE_SM_${CUDA_ARCH}")
452+
message(STATUS "Excluding SM ${CUDA_ARCH}")
453+
endif()
454+
endforeach()
455+
455456
# Apply suffixes based on architecture capabilities
456457
set(CMAKE_CUDA_ARCHITECTURES_NORMALIZED)
457458
set(CMAKE_CUDA_ARCHITECTURES_FAMILIES)

cpp/include/tensorrt_llm/batch_manager/llmRequest.h

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -866,6 +866,7 @@ class GenericLlmRequest
866866
mPrepopulatedPromptLenDraft = 0;
867867
mContextChunkSizeTarget = mPromptLen;
868868
mContextChunkSizeDraft = mPromptLen;
869+
mEstimatedReusableTokens = 0;
869870
mSeqSlot.reset();
870871
}
871872

@@ -1135,6 +1136,23 @@ class GenericLlmRequest
11351136
}
11361137
}
11371138

1139+
/// @brief Get the estimated number of reusable tokens from the KV cache.
1140+
/// @details Set by the capacity scheduler so the micro batch scheduler can
1141+
/// account for cached tokens in its token budget. For subsequent
1142+
/// context chunks, this returns 0 because contextRemainingLength
1143+
/// already reflects the advancement from setPrepopulatedPromptLen.
1144+
[[nodiscard]] SizeType32 getEstimatedReusableTokens() const noexcept
1145+
{
1146+
return mEstimatedReusableTokens;
1147+
}
1148+
1149+
/// @brief Set the estimated number of reusable tokens. Const because
1150+
/// the field is mutable (it's a scheduling cache, not request state).
1151+
void setEstimatedReusableTokens(SizeType32 estimatedReusableTokens) const noexcept
1152+
{
1153+
mEstimatedReusableTokens = estimatedReusableTokens;
1154+
}
1155+
11381156
void setDraftTokens(std::shared_ptr<VecTokens> const& draftTokens)
11391157
{
11401158
mDraftTokens = draftTokens;
@@ -1964,6 +1982,15 @@ class GenericLlmRequest
19641982
SizeType32 mPrepopulatedPromptLenTarget{0};
19651983
SizeType32 mPrepopulatedPromptLenDraft{0};
19661984

1985+
// Estimated number of reusable tokens from the KV cache radix tree.
1986+
// Set by the capacity scheduler (during getNeededBlocksOneStep /
1987+
// getRemainingBlocksToCompletion) so that the micro batch scheduler
1988+
// can account for cached tokens when computing the token budget.
1989+
// Marked mutable because it is a cache/estimate set during const
1990+
// capacity-scheduler queries. Reset to 0 after addSequence sets
1991+
// the authoritative mPrepopulatedPromptLen and advances context position.
1992+
mutable SizeType32 mEstimatedReusableTokens{0};
1993+
19671994
SizeType32 mMaxSentTokenLen;
19681995

19691996
std::optional<TensorPtr> mEmbeddingBias{std::nullopt};

cpp/include/tensorrt_llm/executor/executor.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1533,7 +1533,7 @@ class ExecutorConfig
15331533
static constexpr SizeType32 kDefaultRequestStatsMaxIterations = 0;
15341534

15351535
explicit ExecutorConfig(SizeType32 maxBeamWidth = 1, SchedulerConfig schedulerConfig = SchedulerConfig(),
1536-
KvCacheConfig kvCacheConfig = KvCacheConfig(), bool enableChunkedContext = true, bool normalizeLogProbs = true,
1536+
KvCacheConfig kvCacheConfig = KvCacheConfig(), bool enableChunkedContext = true, bool normalizeLogProbs = false,
15371537
SizeType32 iterStatsMaxIterations = kDefaultIterStatsMaxIterations,
15381538
SizeType32 requestStatsMaxIterations = kDefaultRequestStatsMaxIterations,
15391539
BatchingType batchingType = BatchingType::kINFLIGHT, std::optional<SizeType32> maxBatchSize = std::nullopt,

cpp/kernels/fmha_v2/setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6792,11 +6792,11 @@ def enumerate_kernels():
67926792
head_size_v=512)
67936793
enumerate_qmma_kernels(specs, sm=120)
67946794
enumerate_qmma_flash_kernels(specs, sm=120, dtype='e4m3_fp32')
6795-
# Add bf16 output MLA kernels.
6795+
# Add bf16 output kernels for e4m3 input (MLA and standard head sizes).
67966796
enumerate_qmma_flash_kernels(specs,
67976797
sm=120,
67986798
dtype='e4m3_fp32',
6799-
head_sizes=[192, 576],
6799+
head_sizes=[128, 192, 576],
68006800
output_dtype="bf16")
68016801

68026802
if 'ENABLE_HMMA_FP32' in os.environ:

0 commit comments

Comments
 (0)