You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
+87-23Lines changed: 87 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,10 @@ In this blog, we share the configurations and procedures about how to reproduce
18
18
-[Reproducing steps](#reproducing-steps)
19
19
-[B200 min-latency](#b200-min-latency)
20
20
-[Expected Results](#expected-results)
21
-
-[B200 max-throughput](#b200-max-throughput)
21
+
-[B200 max-throughput with FP8 KV](#b200-max-throughput-for-r1-0528-with-fp8-kv-cache)
22
+
-[Benchmark](#benchmark)
23
+
-[Expected Result Format](#expected-result-format)
24
+
-[B200 max-throughput with FP16 KV](#b200-max-throughput-for-r1-with-fp16-kv-cache)
22
25
-[Benchmark](#benchmark)
23
26
-[Expected Result Format](#expected-result-format)
24
27
-[H200 min-latency](#h200-min-latency)
@@ -181,9 +184,68 @@ Total Token Throughput (tokens/sec): 414.0461
181
184
Total Latency (ms): 74561.7520
182
185
Average request latency (ms): 7456.1219
183
186
```
187
+
### B200 max-throughput for R1-0528 with FP8 KV cache
188
+
189
+
Due to our evaluation found that FP8 KV cache does not introduce obvious accuracy drop compared to BF16 KV cache. See [Precision strategy](./tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md#precision-strategy), the latest [DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4) checkpoint had enabled FP8 KV cache by-default.
184
190
185
-
### B200 max-throughput
186
-
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**
191
+
We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
192
+
193
+
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
Per User Output Throughput [w/ ctx] (tps/user): 6.1806
240
+
Per GPU Output Throughput (tps/gpu): 5393.2755
241
+
```
242
+
243
+
### B200 max-throughput for R1 with FP16 KV cache
244
+
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**.
245
+
246
+
The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
247
+
248
+
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
0 commit comments