Skip to content

Commit 03f1a6a

Browse files
authored
Update DeepSeek R1 perf numbers to latest release/0.20 results (NVIDIA#5235)
1 parent 64b7f04 commit 03f1a6a

File tree

1 file changed

+87
-23
lines changed

1 file changed

+87
-23
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 87 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,10 @@ In this blog, we share the configurations and procedures about how to reproduce
1818
- [Reproducing steps](#reproducing-steps)
1919
- [B200 min-latency](#b200-min-latency)
2020
- [Expected Results](#expected-results)
21-
- [B200 max-throughput](#b200-max-throughput)
21+
- [B200 max-throughput with FP8 KV](#b200-max-throughput-for-r1-0528-with-fp8-kv-cache)
22+
- [Benchmark](#benchmark)
23+
- [Expected Result Format](#expected-result-format)
24+
- [B200 max-throughput with FP16 KV](#b200-max-throughput-for-r1-with-fp16-kv-cache)
2225
- [Benchmark](#benchmark)
2326
- [Expected Result Format](#expected-result-format)
2427
- [H200 min-latency](#h200-min-latency)
@@ -181,9 +184,68 @@ Total Token Throughput (tokens/sec): 414.0461
181184
Total Latency (ms): 74561.7520
182185
Average request latency (ms): 7456.1219
183186
```
187+
### B200 max-throughput for R1-0528 with FP8 KV cache
188+
189+
Due to our evaluation found that FP8 KV cache does not introduce obvious accuracy drop compared to BF16 KV cache. See [Precision strategy](./tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md#precision-strategy), the latest [DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4) checkpoint had enabled FP8 KV cache by-default.
184190

185-
### B200 max-throughput
186-
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**
191+
We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
192+
193+
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
194+
195+
#### Benchmark
196+
```bash
197+
cat >./extra-llm-api-config.yml <<EOF
198+
pytorch_backend_config:
199+
use_cuda_graph: true
200+
cuda_graph_padding_enabled: true
201+
cuda_graph_batch_sizes:
202+
- 896
203+
- 512
204+
- 256
205+
- 128
206+
- 64
207+
- 32
208+
- 16
209+
- 8
210+
- 4
211+
- 2
212+
- 1
213+
print_iter_log: true
214+
kv_cache_dtype: fp8
215+
enable_attention_dp: true
216+
EOF
217+
trtllm-bench --model nvidia/DeepSeek-R1-0528-FP4
218+
throughput
219+
--dataset ${YOUR_DATA_PATH}
220+
--backend pytorch
221+
--tp 8 --ep 8
222+
--extra_llm_api_options ./extra-llm-api-config.yml
223+
--max_batch_size 896
224+
--max_num_tokens 2048
225+
--kv_cache_free_gpu_mem_fraction 0.93
226+
--concurrency 7168
227+
--num_requests 114688
228+
```
229+
#### Expected Result Format
230+
```
231+
===========================================================
232+
= PERFORMANCE OVERVIEW
233+
===========================================================
234+
Request Throughput (req/sec): 21.0675
235+
Total Output Throughput (tokens/sec): 43146.2042
236+
Total Token Throughput (tokens/sec): 65100.6376
237+
Total Latency (ms): 5443839.8140
238+
Average request latency (ms): 332826.9898
239+
Per User Output Throughput [w/ ctx] (tps/user): 6.1806
240+
Per GPU Output Throughput (tps/gpu): 5393.2755
241+
```
242+
243+
### B200 max-throughput for R1 with FP16 KV cache
244+
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**.
245+
246+
The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
247+
248+
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
187249

188250
#### Benchmark
189251
To do the benchmark, run the following command:
@@ -201,20 +263,21 @@ python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
201263
YOUR_DATA_PATH=./dataset.txt
202264

203265
cat >./extra-llm-api-config.yml <<EOF
204-
use_cuda_graph: true
205-
cuda_graph_padding_enabled: true
206-
cuda_graph_batch_sizes:
207-
- 1
208-
- 2
209-
- 4
210-
- 8
211-
- 16
212-
- 32
213-
- 64
214-
- 128
215-
- 256
216-
- 384
217-
print_iter_log: true
266+
pytorch_backend_config:
267+
use_cuda_graph: true
268+
cuda_graph_padding_enabled: true
269+
cuda_graph_batch_sizes:
270+
- 1
271+
- 2
272+
- 4
273+
- 8
274+
- 16
275+
- 32
276+
- 64
277+
- 128
278+
- 256
279+
- 384
280+
print_iter_log: ${PRINT_ITER_LOG}
218281
enable_attention_dp: true
219282
EOF
220283

@@ -239,12 +302,13 @@ The perf might be different from different datasets and machines
239302
===========================================================
240303
= PERFORMANCE OVERVIEW
241304
===========================================================
242-
Request Throughput (req/sec): 17.3885
243-
Total Output Throughput (tokens/sec): 35611.5942
244-
Per User Output Throughput (tokens/sec/user): 11.6701
245-
Per GPU Output Throughput (tokens/sec/gpu): 4451.4493
246-
Total Latency (ms): 2826700.0758
247-
Average request latency (ms): 176064.1921
305+
Request Throughput (req/sec): 17.7657
306+
Total Output Throughput (tokens/sec): 36384.0838
307+
Total Token Throughput (tokens/sec): 54576.1257
308+
Total Latency (ms): 2766684.9197
309+
Average request latency (ms): 172321.7206
310+
Per User Output Throughput [w/ ctx] (tps/user): 11.9263
311+
Per GPU Output Throughput (tps/gpu): 4548.0105
248312
```
249313

250314
### H200 min-latency

0 commit comments

Comments
 (0)