NVIDIA
diff --git a/‎docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md‎
Lines changed: 9 additions & 5 deletions b/‎docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎docs/source/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md‎
Lines changed: 15 additions & 0 deletions b/‎docs/source/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md‎
Lines changed: 15 additions & 0 deletions
@@ -30,7 +30,7 @@ In this blog, we share the configurations and procedures about how to reproduce
       - [Expected Result Format](#expected-result-format-3)
   - [Exploring more ISL/OSL combinations](#exploring-more-islosl-combinations)
     - [WIP: Enable more features by default](#wip-enable-more-features-by-default)
-    - [Not supported: MLA chunked context support on Hopper](#not-supported-mla-chunked-context-support-on-hopper)
+    - [MLA chunked context](#mla-chunked-context)
     - [Out of memory issues](#out-of-memory-issues)
 
 
@@ -69,8 +69,11 @@ For NVIDIA Hopper GPUs, it's recommended to use the FP8 version of the DeepSeek
 YOUR_MODEL_PATH=<YOUR_MODEL_PATH>
 cd $YOUR_MODEL_PATH
 
-## Download FP4 model for Blackwell GPUs
-git clone https://huggingface.co/nvidia/DeepSeek-R1-FP4
+## Download NVFP4 model for Blackwell GPUs
+git clone https://huggingface.co/nvidia/DeepSeek-R1-NVFP4-v2
+
+## Or the 0528 version
+git clone https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2
 
 ## Download FP8 model for Hopper GPUs
 ## FP8 model also works for Blackwell, but FP4 has the best performance on Blackwell.
@@ -402,9 +405,10 @@ Average request latency (ms):                     181540.5739
 ## Exploring more ISL/OSL combinations
 
 To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
+
 ### WIP: Enable more features by default
 
-Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
+Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
 
 Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
 
@@ -414,7 +418,7 @@ For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max
 
 ### MLA chunked context
 
-MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered. 
+MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.
 
 For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
 
 
@@ -306,3 +306,18 @@ Run `bench.sh` to begin a serving benchmark.
 ```shell
 ./bench.sh
 ```
+
+## Troubleshooting
+
+Since Kimi K2 Thinking has larger weight size than other models, it's possible seeing host OOM issues, as the following:
+
+```log
+Loading weights: 100%|█████████████████████| 1408/1408 [03:43<00:00,  6.30it/s]
+ 0: [12/04/2025-18:38:28] [TRT-LLM] [RANK 0] [I] moe_load_balancer finalizing model...
+ 1: [nvl72136-T14:452151:0:452151] Caught signal 7 (Bus error: nonexistent physical address)
+ 1: ==== backtrace (tid: 452151) ====
+ 1:  0  /usr/local/ucx//lib/libucs.so.0(ucs_handle_error+0x2cc) [0xffff9638274c]
+ 1:  1  /usr/local/ucx//lib/libucs.so.0(+0x328fc) [0xffff963828fc]
+ 1:  2  /usr/local/ucx//lib/libucs.so.0(+0x32c78) [0xffff96382c78]
+```
+This can be addressed by mounting `tmpfs:/dev/shm:size=640G` when launching the Docker container, to increase the shm size that the container can access.