Skip to content

Commit 6345074

Browse files
dominicshanshanyunruissunnyqggbrb-nvSuperjomn
authored
[None][chore] Weekly mass integration of release/1.1 -- rebase (NVIDIA#9522)
Signed-off-by: yunruis <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Wangshanshan <[email protected]> Signed-off-by: qgai <[email protected]> Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Yan Chunwei <[email protected]> Signed-off-by: Junyi Xu <[email protected]> Signed-off-by: Simeng Liu <[email protected]> Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Jin Li <[email protected]> Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Vincent Zhang <[email protected]> Signed-off-by: peaceh <[email protected]> Signed-off-by: Michal Guzek <[email protected]> Signed-off-by: Michal Guzek <[email protected]> Signed-off-by: Chang Liu (Enterprise Products) <[email protected]> Signed-off-by: leslie-fang25 <[email protected]> Signed-off-by: Shunkang <[email protected]> Signed-off-by: junq <[email protected]> Co-authored-by: yunruis <[email protected]> Co-authored-by: sunnyqgg <[email protected]> Co-authored-by: brb-nv <[email protected]> Co-authored-by: Yan Chunwei <[email protected]> Co-authored-by: JunyiXu-nv <[email protected]> Co-authored-by: Simeng Liu <[email protected]> Co-authored-by: Guoming Zhang <[email protected]> Co-authored-by: Jin Li <[email protected]> Co-authored-by: Ivy Zhang <[email protected]> Co-authored-by: Vincent Zhang <[email protected]> Co-authored-by: peaceh-nv <[email protected]> Co-authored-by: Michal Guzek <[email protected]> Co-authored-by: Chang Liu <[email protected]> Co-authored-by: Leslie Fang <[email protected]> Co-authored-by: Shunkangz <[email protected]> Co-authored-by: Shunkang <[email protected]> Co-authored-by: QI JUN <[email protected]>
1 parent ae0124e commit 6345074

37 files changed

+582
-120
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
230230

231231
TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
232232

233-
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
233+
[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.1/docs/source/developer-guide/overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
234234

235235
TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
236236

cpp/kernels/fmha_v2/setup.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6398,6 +6398,16 @@ def enumerate_kernels():
63986398
and kspec.cross_mha == False
63996399
and kspec.flash_attention == True
64006400
and kspec.input_layout != InputLayout.SEPARATE_Q_K_V)
6401+
# Gemma3 VL support.
6402+
or (kspec.sm == 100
6403+
and kspec.dtype in ['fp16', 'bf16', 'fp16_fp32', 'e4m3', 'e4m3_fp32']
6404+
and kspec.head_size == 72
6405+
and kspec.head_size_v == 0
6406+
and kspec.sage_block_sizes is None
6407+
and kspec.version == 2
6408+
and kspec.cross_mha == False
6409+
and kspec.flash_attention == True
6410+
and kspec.input_layout != InputLayout.SEPARATE_Q_K_V)
64016411
# Deepseek MLA (generation 576/512 paged)
64026412
or (kspec.sm in [90, 100, 120]
64036413
and kspec.dtype in ['bf16', 'e4m3_fp32']

cpp/tensorrt_llm/common/opUtils.cpp

Lines changed: 94 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -179,16 +179,24 @@ class PerCudaCtxPerThreadSingletonCreator
179179
PerCudaCtxPerThreadSingletonCreator(CreatorFunc creator, DeleterFunc deleter)
180180
: mCreator{std::move(creator)}
181181
, mDeleter{std::move(deleter)}
182+
, mObservers{new std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>()}
182183
{
183184
}
184185

186+
~PerCudaCtxPerThreadSingletonCreator()
187+
{
188+
std::lock_guard<std::mutex> lk{mMutex};
189+
delete mObservers;
190+
mObservers = nullptr;
191+
}
192+
185193
std::shared_ptr<T> operator()()
186194
{
187195
std::lock_guard<std::mutex> lk{mMutex};
188196
CUcontext ctx{getCurrentCudaCtx()};
189197
std::thread::id thread = std::this_thread::get_id();
190198
auto const key = std::make_tuple(ctx, thread);
191-
std::shared_ptr<T> result = mObservers[key].lock();
199+
std::shared_ptr<T> result = (*mObservers)[key].lock();
192200
if (result == nullptr)
193201
{
194202
TLLM_LOG_TRACE("creating singleton instance for CUDA context %lu and thread %lu", ctx, thread);
@@ -202,6 +210,11 @@ class PerCudaCtxPerThreadSingletonCreator
202210
}
203211
mDeleter(obj);
204212

213+
if (mObservers == nullptr)
214+
{
215+
return;
216+
}
217+
205218
// Clears observer to avoid growth of mObservers, in case users creates/destroys cuda contexts
206219
// frequently.
207220
std::shared_ptr<T> observedObjHolder; // Delay destroy to avoid dead lock.
@@ -210,17 +223,18 @@ class PerCudaCtxPerThreadSingletonCreator
210223
// thread just before we lock mMutex. We can't infer that the observer is stale from the fact that
211224
// obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one
212225
// atomic operation, and the observer may be changed to observe another instance.
213-
if (mObservers.find(key) == mObservers.end())
226+
auto it = mObservers->find(key);
227+
if (it == mObservers->end())
214228
{
215229
return;
216230
}
217-
observedObjHolder = mObservers.at(key).lock();
231+
observedObjHolder = it->second.lock();
218232
if (observedObjHolder == nullptr)
219233
{
220-
mObservers.erase(key);
234+
mObservers->erase(it);
221235
}
222236
}};
223-
mObservers.at(key) = result;
237+
(*mObservers)[key] = result;
224238
}
225239
else
226240
{
@@ -235,24 +249,78 @@ class PerCudaCtxPerThreadSingletonCreator
235249
mutable std::mutex mMutex;
236250
// CUDA resources are per-context and per-thread.
237251
using CacheKey = std::tuple<CUcontext, std::thread::id>;
238-
std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>> mObservers;
252+
std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>* mObservers;
253+
};
254+
255+
// Structure to hold memory information
256+
struct MemoryInfo
257+
{
258+
size_t free_mb;
259+
size_t total_mb;
260+
float free_percent;
239261
};
240262

263+
// Helper function to get current memory information
264+
MemoryInfo getMemoryInfo()
265+
{
266+
size_t free_mem = 0, total_mem = 0;
267+
TLLM_CUDA_CHECK(cudaMemGetInfo(&free_mem, &total_mem));
268+
269+
size_t const free_mb = free_mem / (1024 * 1024);
270+
size_t const total_mb = total_mem / (1024 * 1024);
271+
float const free_percent = (total_mem > 0) ? (static_cast<float>(free_mem) / total_mem * 100.0f) : 0.0f;
272+
273+
return {free_mb, total_mb, free_percent};
274+
}
275+
276+
// Helper function to log current memory usage
277+
void logMemoryUsage(char const* operation, CUcontext ctx)
278+
{
279+
auto const mem = getMemoryInfo();
280+
TLLM_LOG_DEBUG("%s: Context=%p, Free Memory=%zu MB (%.1f%%), Total=%zu MB", operation, ctx, mem.free_mb,
281+
mem.free_percent, mem.total_mb);
282+
}
283+
284+
// Helper function to throw
285+
void throwCublasErrorWithMemInfo(char const* operation, CUcontext ctx, cublasStatus_t status)
286+
{
287+
auto const mem = getMemoryInfo();
288+
TLLM_THROW(
289+
"Failed to create %s. "
290+
"Status: %d, Context: %p, Free Memory: %zu MB (%.1f%%), Total: %zu MB. "
291+
"Consider reducing kv_cache_config.free_gpu_memory_fraction.",
292+
operation, status, ctx, mem.free_mb, mem.free_percent, mem.total_mb);
293+
}
294+
241295
} // namespace
242296

243297
std::shared_ptr<cublasHandle_t> getCublasHandle()
244298
{
245299
static PerCudaCtxPerThreadSingletonCreator<cublasHandle_t> creator(
246300
[]() -> auto
247301
{
248-
auto handle = std::unique_ptr<cublasHandle_t>(new cublasHandle_t);
249-
TLLM_CUDA_CHECK(cublasCreate(handle.get()));
302+
CUcontext ctx = getCurrentCudaCtx();
303+
logMemoryUsage("Creating cublas handle", ctx);
304+
305+
auto handle = std::make_unique<cublasHandle_t>();
306+
auto status = cublasCreate(handle.get());
307+
308+
if (status != CUBLAS_STATUS_SUCCESS)
309+
{
310+
throwCublasErrorWithMemInfo("cublas handle", ctx, status);
311+
}
312+
250313
return handle;
251314
},
252315
[](cublasHandle_t* handle)
253316
{
254-
TLLM_CUDA_CHECK(cublasDestroy(*handle));
317+
auto status = cublasDestroy(*handle);
318+
if (status != CUBLAS_STATUS_SUCCESS)
319+
{
320+
TLLM_LOG_WARNING("Failed to destroy cublas handle. Status: %d", status);
321+
}
255322
delete handle;
323+
handle = nullptr;
256324
});
257325
return creator();
258326
}
@@ -262,14 +330,28 @@ std::shared_ptr<cublasLtHandle_t> getCublasLtHandle()
262330
static PerCudaCtxPerThreadSingletonCreator<cublasLtHandle_t> creator(
263331
[]() -> auto
264332
{
265-
auto handle = std::unique_ptr<cublasLtHandle_t>(new cublasLtHandle_t);
266-
TLLM_CUDA_CHECK(cublasLtCreate(handle.get()));
333+
CUcontext ctx = getCurrentCudaCtx();
334+
logMemoryUsage("Creating cublasLt handle", ctx);
335+
336+
auto handle = std::make_unique<cublasLtHandle_t>();
337+
auto status = cublasLtCreate(handle.get());
338+
339+
if (status != CUBLAS_STATUS_SUCCESS)
340+
{
341+
throwCublasErrorWithMemInfo("cublasLt handle", ctx, status);
342+
}
343+
267344
return handle;
268345
},
269346
[](cublasLtHandle_t* handle)
270347
{
271-
TLLM_CUDA_CHECK(cublasLtDestroy(*handle));
348+
auto status = cublasLtDestroy(*handle);
349+
if (status != CUBLAS_STATUS_SUCCESS)
350+
{
351+
TLLM_LOG_WARNING("Failed to destroy cublasLt handle. Status: %d", status);
352+
}
272353
delete handle;
354+
handle = nullptr;
273355
});
274356
return creator();
275357
}

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -177,13 +177,13 @@ std::vector<CutlassTileConfig> get_candidate_tiles(
177177
{
178178
if (sm == 89 || sm >= 120)
179179
{
180-
return {CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128,
181-
CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,
180+
return {CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,
182181
CutlassTileConfig::CtaShape64x128x64_WarpShape64x32x64,
183182
CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,
184183
CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,
185184
CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,
186-
CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};
185+
CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64,
186+
CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128};
187187
}
188188
else
189189
{

cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ FmhaDispatcher::FmhaDispatcher(MHARunnerFixedParams fixedParams)
4949
// TRTLLM-GEN only supports power of 2 head sizes.
5050
// The exception will fall back to fmha v2.
5151
// Please update fmha_v2/setup.py if you want to add more supported head sizes.
52-
, mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize != 80)
52+
, mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize != 80 && fixedParams.headSize != 72)
5353
{
5454
if (mUseTllmGen)
5555
{

docs/source/blogs/H100vsA100.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x
2828

2929
<sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1</sub>
3030

31-
The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)
31+
The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)
3232

3333
Stay tuned for a highlight on Llama coming soon!
3434

docs/source/blogs/H200launch.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news
2121

2222
<sup>*(1) Largest batch supported on given TP configuration by power of 2.*</sup> <sup>*(2) TP = Tensor Parallelism*</sup>
2323

24-
Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).
24+
Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).
2525

2626
### H200 vs H100
2727

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
124124
125125
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
126126
127-
For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
127+
For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
128128
129129
### Triton Inference Server
130130

docs/source/features/disagg-serving.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
9494

9595
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
9696

97-
For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
97+
For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
9898

9999
### trtllm-serve
100100

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ Welcome to TensorRT LLM's Documentation!
7777
features/ray-orchestrator.md
7878
features/torch_compile_and_piecewise_cuda_graph.md
7979

80+
8081
.. toctree::
8182
:maxdepth: 2
8283
:caption: Developer Guide

0 commit comments

Comments
 (0)