LarryXFly
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎cpp/kernels/fmha_v2/setup.py‎
Lines changed: 10 additions & 0 deletions b/‎cpp/kernels/fmha_v2/setup.py‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎cpp/tensorrt_llm/common/opUtils.cpp‎
Lines changed: 94 additions & 12 deletions b/‎cpp/tensorrt_llm/common/opUtils.cpp‎
Lines changed: 94 additions & 12 deletions
diff --git a/‎cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp‎
Lines changed: 3 additions & 3 deletions b/‎cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp‎
Lines changed: 1 addition & 1 deletion b/‎cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/H100vsA100.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/H100vsA100.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/H200launch.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/H200launch.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/features/disagg-serving.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/features/disagg-serving.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/index.rst‎
Lines changed: 1 addition & 0 deletions
@@ -230,7 +230,7 @@ Serverless TensorRT LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.co
 
 TensorRT LLM is an open-sourced library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, [FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/), INT4 [AWQ](https://arxiv.org/abs/2306.00978), INT8 [SmoothQuant](https://arxiv.org/abs/2211.10438), ...), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
 
-[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch/arch_overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
+[Architected on PyTorch](https://github.com/NVIDIA/TensorRT-LLM/blob/release/1.1/docs/source/developer-guide/overview.md), TensorRT LLM provides a high-level Python [LLM API](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#llm-api) that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server).
 
 TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](./tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs.
 
 
@@ -6398,6 +6398,16 @@ def enumerate_kernels():
                   and kspec.cross_mha     == False
                   and kspec.flash_attention == True
                   and kspec.input_layout != InputLayout.SEPARATE_Q_K_V)
+                  # Gemma3 VL support.
+                  or  (kspec.sm           == 100
+                  and kspec.dtype         in ['fp16', 'bf16', 'fp16_fp32', 'e4m3', 'e4m3_fp32']
+                  and kspec.head_size     == 72
+                  and kspec.head_size_v   == 0
+                  and kspec.sage_block_sizes is None
+                  and kspec.version       == 2
+                  and kspec.cross_mha     == False
+                  and kspec.flash_attention == True
+                  and kspec.input_layout != InputLayout.SEPARATE_Q_K_V)
                   # Deepseek MLA (generation 576/512 paged)
                   or (kspec.sm            in [90, 100, 120]
                   and kspec.dtype         in ['bf16', 'e4m3_fp32']
 
@@ -179,16 +179,24 @@ class PerCudaCtxPerThreadSingletonCreator
     PerCudaCtxPerThreadSingletonCreator(CreatorFunc creator, DeleterFunc deleter)
         : mCreator{std::move(creator)}
         , mDeleter{std::move(deleter)}
+        , mObservers{new std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>()}
     {
     }
 
+    ~PerCudaCtxPerThreadSingletonCreator()
+    {
+        std::lock_guard<std::mutex> lk{mMutex};
+        delete mObservers;
+        mObservers = nullptr;
+    }
+
     std::shared_ptr<T> operator()()
     {
         std::lock_guard<std::mutex> lk{mMutex};
         CUcontext ctx{getCurrentCudaCtx()};
         std::thread::id thread = std::this_thread::get_id();
         auto const key = std::make_tuple(ctx, thread);
-        std::shared_ptr<T> result = mObservers[key].lock();
+        std::shared_ptr<T> result = (*mObservers)[key].lock();
         if (result == nullptr)
         {
             TLLM_LOG_TRACE("creating singleton instance for CUDA context %lu and thread %lu", ctx, thread);
@@ -202,6 +210,11 @@ class PerCudaCtxPerThreadSingletonCreator
                     }
                     mDeleter(obj);
 
+                    if (mObservers == nullptr)
+                    {
+                        return;
+                    }
+
                     // Clears observer to avoid growth of mObservers, in case users creates/destroys cuda contexts
                     // frequently.
                     std::shared_ptr<T> observedObjHolder; // Delay destroy to avoid dead lock.
@@ -210,17 +223,18 @@ class PerCudaCtxPerThreadSingletonCreator
                     // thread just before we lock mMutex. We can't infer that the observer is stale from the fact that
                     // obj is destroyed, because shared_ptr ref-count checking and observer removing are not in one
                     // atomic operation, and the observer may be changed to observe another instance.
-                    if (mObservers.find(key) == mObservers.end())
+                    auto it = mObservers->find(key);
+                    if (it == mObservers->end())
                     {
                         return;
                     }
-                    observedObjHolder = mObservers.at(key).lock();
+                    observedObjHolder = it->second.lock();
                     if (observedObjHolder == nullptr)
                     {
-                        mObservers.erase(key);
+                        mObservers->erase(it);
                     }
                 }};
-            mObservers.at(key) = result;
+            (*mObservers)[key] = result;
         }
         else
         {
@@ -235,24 +249,78 @@ class PerCudaCtxPerThreadSingletonCreator
     mutable std::mutex mMutex;
     // CUDA resources are per-context and per-thread.
     using CacheKey = std::tuple<CUcontext, std::thread::id>;
-    std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>> mObservers;
+    std::unordered_map<CacheKey, std::weak_ptr<T>, hash<CacheKey>>* mObservers;
+};
+
+// Structure to hold memory information
+struct MemoryInfo
+{
+    size_t free_mb;
+    size_t total_mb;
+    float free_percent;
 };
 
+// Helper function to get current memory information
+MemoryInfo getMemoryInfo()
+{
+    size_t free_mem = 0, total_mem = 0;
+    TLLM_CUDA_CHECK(cudaMemGetInfo(&free_mem, &total_mem));
+
+    size_t const free_mb = free_mem / (1024 * 1024);
+    size_t const total_mb = total_mem / (1024 * 1024);
+    float const free_percent = (total_mem > 0) ? (static_cast<float>(free_mem) / total_mem * 100.0f) : 0.0f;
+
+    return {free_mb, total_mb, free_percent};
+}
+
+// Helper function to log current memory usage
+void logMemoryUsage(char const* operation, CUcontext ctx)
+{
+    auto const mem = getMemoryInfo();
+    TLLM_LOG_DEBUG("%s: Context=%p, Free Memory=%zu MB (%.1f%%), Total=%zu MB", operation, ctx, mem.free_mb,
+        mem.free_percent, mem.total_mb);
+}
+
+// Helper function to throw
+void throwCublasErrorWithMemInfo(char const* operation, CUcontext ctx, cublasStatus_t status)
+{
+    auto const mem = getMemoryInfo();
+    TLLM_THROW(
+        "Failed to create %s. "
+        "Status: %d, Context: %p, Free Memory: %zu MB (%.1f%%), Total: %zu MB. "
+        "Consider reducing kv_cache_config.free_gpu_memory_fraction.",
+        operation, status, ctx, mem.free_mb, mem.free_percent, mem.total_mb);
+}
+
 } // namespace
 
 std::shared_ptr<cublasHandle_t> getCublasHandle()
 {
     static PerCudaCtxPerThreadSingletonCreator<cublasHandle_t> creator(
         []() -> auto
         {
-            auto handle = std::unique_ptr<cublasHandle_t>(new cublasHandle_t);
-            TLLM_CUDA_CHECK(cublasCreate(handle.get()));
+            CUcontext ctx = getCurrentCudaCtx();
+            logMemoryUsage("Creating cublas handle", ctx);
+
+            auto handle = std::make_unique<cublasHandle_t>();
+            auto status = cublasCreate(handle.get());
+
+            if (status != CUBLAS_STATUS_SUCCESS)
+            {
+                throwCublasErrorWithMemInfo("cublas handle", ctx, status);
+            }
+
             return handle;
         },
         [](cublasHandle_t* handle)
         {
-            TLLM_CUDA_CHECK(cublasDestroy(*handle));
+            auto status = cublasDestroy(*handle);
+            if (status != CUBLAS_STATUS_SUCCESS)
+            {
+                TLLM_LOG_WARNING("Failed to destroy cublas handle. Status: %d", status);
+            }
             delete handle;
+            handle = nullptr;
         });
     return creator();
 }
@@ -262,14 +330,28 @@ std::shared_ptr<cublasLtHandle_t> getCublasLtHandle()
     static PerCudaCtxPerThreadSingletonCreator<cublasLtHandle_t> creator(
         []() -> auto
         {
-            auto handle = std::unique_ptr<cublasLtHandle_t>(new cublasLtHandle_t);
-            TLLM_CUDA_CHECK(cublasLtCreate(handle.get()));
+            CUcontext ctx = getCurrentCudaCtx();
+            logMemoryUsage("Creating cublasLt handle", ctx);
+
+            auto handle = std::make_unique<cublasLtHandle_t>();
+            auto status = cublasLtCreate(handle.get());
+
+            if (status != CUBLAS_STATUS_SUCCESS)
+            {
+                throwCublasErrorWithMemInfo("cublasLt handle", ctx, status);
+            }
+
             return handle;
         },
         [](cublasLtHandle_t* handle)
         {
-            TLLM_CUDA_CHECK(cublasLtDestroy(*handle));
+            auto status = cublasLtDestroy(*handle);
+            if (status != CUBLAS_STATUS_SUCCESS)
+            {
+                TLLM_LOG_WARNING("Failed to destroy cublasLt handle. Status: %d", status);
+            }
             delete handle;
+            handle = nullptr;
         });
     return creator();
 }
@@ -177,13 +177,13 @@ std::vector<CutlassTileConfig> get_candidate_tiles(
         {
             if (sm == 89 || sm >= 120)
             {
-                return {CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128,
-                    CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,
+                return {CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,
                     CutlassTileConfig::CtaShape64x128x64_WarpShape64x32x64,
                     CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,
                     CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,
                     CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,
-                    CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};
+                    CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64,
+                    CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128};
             }
             else
             {
 
@@ -49,7 +49,7 @@ FmhaDispatcher::FmhaDispatcher(MHARunnerFixedParams fixedParams)
     // TRTLLM-GEN only supports power of 2 head sizes.
     // The exception will fall back to fmha v2.
     // Please update fmha_v2/setup.py if you want to add more supported head sizes.
-    , mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize != 80)
+    , mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize != 80 && fixedParams.headSize != 72)
 {
     if (mUseTllmGen)
     {
 
@@ -28,7 +28,7 @@ TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x
 
 <sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1</sub>
 
-The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html)
+The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html)
 
 Stay tuned for a highlight on Llama coming soon!
 
 
@@ -21,7 +21,7 @@ TensorRT LLM evaluation of the [new H200 GPU](https://nvidianews.nvidia.com/news
 
 <sup>*(1) Largest batch supported on given TP configuration by power of 2.*</sup> <sup>*(2) TP = Tensor Parallelism*</sup>
 
-Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).
+Additional Performance data is available on the [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference/ai-inference) page, & soon in [TensorRT LLM's Performance Documentation](https://nvidia.github.io/TensorRT-LLM/0.21.0/performance/perf-overview.html).
 
 ### H200 vs H100
 
 
@@ -124,7 +124,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
 
 Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
 
-For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
+For more information on how to use Dynamo with TensorRT LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
 
 ### Triton Inference Server
 
 
@@ -94,7 +94,7 @@ In the Dynamo workflow, requests are initially processed by pre- and post-proces
 
 Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
 
-For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/examples/trtllm.html).
+For more information on how to use Dynamo with TensorRT-LLM, please refer to [this documentation](https://docs.nvidia.com/dynamo/latest/backends/trtllm/README.html).
 
 ### trtllm-serve
 
 
@@ -77,6 +77,7 @@ Welcome to TensorRT LLM's Documentation!
    features/ray-orchestrator.md
    features/torch_compile_and_piecewise_cuda_graph.md
 
+
 .. toctree::
    :maxdepth: 2
    :caption: Developer Guide
Original file line number	Diff line number	Diff line change
`@@ -177,13 +177,13 @@ std::vector<CutlassTileConfig> get_candidate_tiles(`
`177`	`177`	`{`
`178`	`178`	`if (sm == 89 \|\| sm >= 120)`
`179`	`179`	`{`
`180`		`- return {CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128,`
`181`		`- CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,`
	`180`	`+ return {CutlassTileConfig::CtaShape32x128x64_WarpShape32x32x64,`
`182`	`181`	`CutlassTileConfig::CtaShape64x128x64_WarpShape64x32x64,`
`183`	`182`	`CutlassTileConfig::CtaShape64x64x128_WarpShape32x64x64,`
`184`	`183`	`CutlassTileConfig::CtaShape128x64x64_WarpShape64x32x64,`
`185`	`184`	`CutlassTileConfig::CtaShape128x256x64_WarpShape64x64x64,`
`186`		`- CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64};`
	`185`	`+ CutlassTileConfig::CtaShape256x128x64_WarpShape64x64x64,`
	`186`	`+ CutlassTileConfig::CtaShape16x256x128_WarpShape16x64x128};`
`187`	`187`	`}`
`188`	`188`	`else`
`189`	`189`	`{`
Original file line number	Diff line number	Diff line change
`@@ -49,7 +49,7 @@ FmhaDispatcher::FmhaDispatcher(MHARunnerFixedParams fixedParams)`
`49`	`49`	`// TRTLLM-GEN only supports power of 2 head sizes.`
`50`	`50`	`// The exception will fall back to fmha v2.`
`51`	`51`	`// Please update fmha_v2/setup.py if you want to add more supported head sizes.`
`52`		`- , mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize != 80)`
	`52`	`+ , mUseTllmGen(tensorrt_llm::common::isSM100Family() && fixedParams.headSize != 80 && fixedParams.headSize != 72)`
`53`	`53`	`{`
`54`	`54`	`if (mUseTllmGen)`
`55`	`55`	`{`