Skip to content

Commit 069b05c

Browse files
authored
[TRTLLM-9706] [doc] Update wide EP documents (#9724)
Signed-off-by: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
1 parent 03f89d7 commit 069b05c

File tree

6 files changed

+363
-169
lines changed

6 files changed

+363
-169
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ In this blog, we share the configurations and procedures about how to reproduce
3030
- [Expected Result Format](#expected-result-format-3)
3131
- [Exploring more ISL/OSL combinations](#exploring-more-islosl-combinations)
3232
- [WIP: Enable more features by default](#wip-enable-more-features-by-default)
33-
- [Not supported: MLA chunked context support on Hopper](#not-supported-mla-chunked-context-support-on-hopper)
33+
- [MLA chunked context](#mla-chunked-context)
3434
- [Out of memory issues](#out-of-memory-issues)
3535

3636

@@ -69,8 +69,11 @@ For NVIDIA Hopper GPUs, it's recommended to use the FP8 version of the DeepSeek
6969
YOUR_MODEL_PATH=<YOUR_MODEL_PATH>
7070
cd $YOUR_MODEL_PATH
7171

72-
## Download FP4 model for Blackwell GPUs
73-
git clone https://huggingface.co/nvidia/DeepSeek-R1-FP4
72+
## Download NVFP4 model for Blackwell GPUs
73+
git clone https://huggingface.co/nvidia/DeepSeek-R1-NVFP4-v2
74+
75+
## Or the 0528 version
76+
git clone https://huggingface.co/nvidia/DeepSeek-R1-0528-NVFP4-v2
7477

7578
## Download FP8 model for Hopper GPUs
7679
## FP8 model also works for Blackwell, but FP4 has the best performance on Blackwell.
@@ -402,9 +405,10 @@ Average request latency (ms): 181540.5739
402405
## Exploring more ISL/OSL combinations
403406

404407
To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
408+
405409
### WIP: Enable more features by default
406410

407-
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
411+
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
408412

409413
Note that, `max_batch_size` and `max_num_tokens` can easily affect the performance. The default values for them are already carefully designed and should deliver good performance on overall cases, however, you may still need to tune it for peak performance.
410414

@@ -414,7 +418,7 @@ For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max
414418

415419
### MLA chunked context
416420

417-
MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.
421+
MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.
418422

419423
For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
420424

docs/source/deployment-guide/deployment-guide-for-kimi-k2-thinking-on-trtllm.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -306,3 +306,18 @@ Run `bench.sh` to begin a serving benchmark.
306306
```shell
307307
./bench.sh
308308
```
309+
310+
## Troubleshooting
311+
312+
Since Kimi K2 Thinking has larger weight size than other models, it's possible seeing host OOM issues, as the following:
313+
314+
```log
315+
Loading weights: 100%|█████████████████████| 1408/1408 [03:43<00:00, 6.30it/s]
316+
0: [12/04/2025-18:38:28] [TRT-LLM] [RANK 0] [I] moe_load_balancer finalizing model...
317+
1: [nvl72136-T14:452151:0:452151] Caught signal 7 (Bus error: nonexistent physical address)
318+
1: ==== backtrace (tid: 452151) ====
319+
1: 0 /usr/local/ucx//lib/libucs.so.0(ucs_handle_error+0x2cc) [0xffff9638274c]
320+
1: 1 /usr/local/ucx//lib/libucs.so.0(+0x328fc) [0xffff963828fc]
321+
1: 2 /usr/local/ucx//lib/libucs.so.0(+0x32c78) [0xffff96382c78]
322+
```
323+
This can be addressed by mounting `tmpfs:/dev/shm:size=640G` when launching the Docker container, to increase the shm size that the container can access.

0 commit comments

Comments
 (0)