@@ -23,13 +23,18 @@ The performance numbers below were collected using the steps described in this d
2323
2424Testing was performed on models with weights quantized using [ ModelOpt] ( https://nvidia.github.io/TensorRT-Model-Optimizer/# ) and published by NVIDIA on the [ Model Optimizer HuggingFace Collection] ( https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4 ) .
2525
26+ * (NEW for v1.0) RTX 6000 Pro Blackwell Server Edition Benchmarks:*
27+
28+ RTX 6000 Pro Blackwell Server Edition data is now included in the perf overview. RTX 6000 systems can benefit from enabling pipeline parallelism (PP) in LLM workloads, so we included several new benchmarks for this GPU at various TP x PP combinations. That data is presented in a separate table for each network.
29+
30+
2631### Hardware
2732The following GPU variants were used for testing:
2833- H100 SXM 80GB (DGX H100)
2934- H200 SXM 141GB (DGX H200)
30- - GH200 96GB HBM3 (480GB LPDDR5X)
3135- B200 180GB (DGX B200)
3236- GB200 192GB (GB200 NVL72)
37+ - RTX 6000 Pro Blackwell Server Edition
3338
3439Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.
3540
@@ -38,128 +43,203 @@ Other hardware variants may have different TDP, memory bandwidth, core count, or
3843``` text
3944nvidia/Llama-3.3-70B-Instruct-FP4
4045nvidia/Llama-3.1-405B-Instruct-FP4
46+ nvidia/Qwen3-235B-A22B-FP4
47+ nvidia/Qwen3-30B-A3B-FP4
48+ nvidia/DeepSeek-R1-0528-FP4
4149```
4250
43- #### Llama 3.3 70B FP4
44-
45- | | GPU: | B200 | GB200 |
46- | :-----------------------------| :---| :----------| :--------------|
47- | | TP Size | 1 | 1 |
48- | ISL, OSL | | | |
49- | | | | |
50- | 128, 128 | | 10,613.84 | 11,100.97 |
51- | 128, 2048 | | 9,445.51 | 10,276.05 |
52- | 128, 4096 | | 6,276.85 | 7,351.12 |
53- | 500, 2000 | | 6,983.27 | 8,194.30 |
54- | 1000, 1000 | | 6,434.29 | 7,401.80 |
55- | 1000, 2000 | | 6,725.03 | 6,478.72 |
56- | 1024, 2048 | | 6,546.61 | 7,922.88 |
57- | 2048, 128 | | 1,330.35 | 1,418.47 |
58- | 2048, 2048 | | 4,528.48 | 5,326.77 |
59- | 5000, 500 | | 1,427.44 | 1,502.44 |
60- | 20000, 2000 | | 636.36 | 732.43 |
61-
62- #### Llama 3.1 405B FP4
63-
64- | | GPU: | B200 | GB200 |
65- | :-----------------------------| :---| :---------| :--------------|
66- | | TP Size | 4 | 4 |
67- | ISL, OSL | | | |
68- | | | | |
69- | 128, 128 | | 6,218.89 | 6,598.97 |
70- | 128, 2048 | | 7,178.10 | 7,497.40 |
71- | 128, 4096 | | 5,890.89 | 5,898.19 |
72- | 500, 2000 | | 5,844.37 | 6,198.33 |
73- | 1000, 1000 | | 4,958.53 | 5,243.35 |
74- | 1000, 2000 | | 4,874.16 | 4,905.51 |
75- | 1024, 2048 | | 4,833.19 | 4,686.38 |
76- | 2048, 128 | | 737.95 | 761.58 |
77- | 2048, 2048 | | 4,024.02 | 4,326.56 |
78- | 5000, 500 | | 1,032.40 | 1,078.87 |
79- | 20000, 2000 | | 667.39 | 649.95 |
80-
8151### FP8 Models
8252
8353``` text
8454nvidia/Llama-3.1-8B-Instruct-FP8
8555nvidia/Llama-3.3-70B-Instruct-FP8
8656nvidia/Llama-3.1-405B-Instruct-FP8
8757nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
58+ nvidia/Qwen3-235B-A22B-FP8
8859```
8960
90- #### Llama 3.1 8B FP8
91-
92- | | GPU: | GH200 | H100 | H200 |
93- | :-----------------------------| :---| :--------------| :-----------------| :------------------|
94- | | TP Size | 1 | 1 | 1 |
95- | ISL, OSL | | | | |
96- | | | | | |
97- | 128, 128 | | 27,304.25 | 26,401.48 | 27,027.80 |
98- | 128, 2048 | | 24,045.60 | 21,413.21 | 23,102.25 |
99- | 128, 4096 | | 15,409.85 | 13,541.54 | 17,396.83 |
100- | 500, 2000 | | 20,123.88 | 17,571.01 | 19,759.16 |
101- | 1000, 1000 | | 16,352.99 | 14,991.62 | 17,162.49 |
102- | 1000, 2000 | | 15,705.82 | 13,505.23 | 16,227.11 |
103- | 1024, 2048 | | 16,102.52 | 13,165.91 | 16,057.66 |
104- | 2048, 128 | | 3,573.85 | 3,275.55 | 3,390.69 |
105- | 2048, 2048 | | 10,767.05 | 9,462.43 | 11,822.14 |
106- | 5000, 500 | | 3,584.74 | 3,276.47 | 3,758.08 |
107- | 20000, 2000 | | 1,393.31 | 1,340.69 | 1,705.68 |
108-
109- #### Llama 3.3 70B FP8
110-
111- | | GPU: | H100 | H200 |
112- | :-----------------------------| :---| :-----------------| :------------------|
113- | | TP Size | 2 | 2 |
114- | ISL, OSL | | | |
115- | | | | |
116- | 128, 128 | | 6,092.28 | 6,327.98 |
117- | 128, 2048 | | 5,892.94 | 7,467.36 |
118- | 128, 4096 | | 3,828.46 | 5,526.42 |
119- | 500, 2000 | | 4,654.74 | 6,639.15 |
120- | 1000, 1000 | | 4,181.06 | 4,773.33 |
121- | 1000, 2000 | | 3,708.93 | 5,790.36 |
122- | 1024, 2048 | | 3,785.04 | 5,480.44 |
123- | 2048, 128 | | 723.40 | 747.55 |
124- | 2048, 2048 | | 2,785.53 | 3,775.80 |
125- | 5000, 500 | | 865.55 | 978.28 |
126- | 20000, 2000 | | 411.85 | 609.42 |
127-
128- #### Llama 3.1 405B FP8
129- | | GPU: | H100 | H200 |
130- | :-----------------------------| :---| :-----------------| :------------------|
131- | | TP Size | 8 | 8 |
132- | Runtime Input/Output Lengths | | | |
133- | | | | |
134- | 128, 128 | | | 3,705.18 |
135- | 128, 2048 | | 4,517.39 | 4,715.13 |
136- | 128, 4096 | | 2,910.31 | 4,475.91 |
137- | 500, 2000 | | 3,664.62 | 4,804.10 |
138- | 1000, 1000 | | 2,955.50 | 3,208.25 |
139- | 1000, 2000 | | 2,884.69 | 3,630.29 |
140- | 1024, 2048 | | 3,237.41 | 3,609.50 |
141- | 2048, 128 | | 433.47 | 441.35 |
142- | 2048, 2048 | | 2,216.55 | 2,840.86 |
143- | 5000, 500 | | 579.05 | 645.26 |
144- | 20000, 2000 | | 363.27 | 509.87 |
145-
146- #### Llama 4 Maverick FP8
147-
148- Note: Performance for Llama 4 on sequence lengths less than 8,192 tokens is affected by an issue introduced in v0.21. To reproduce the Llama 4 performance noted here, please use v0.20
149-
150- | | GPU | H200 | H100 |
151- | :-----------------------------| :---| :------------------| :-----------------|
152- | | TP Size | 8 | 8 |
153- | ISL, OSL | | | |
154- | | | | |
155- | 128, 2048 | | 27,543.87 | |
156- | 128, 4096 | | 18,541.01 | 11,163.12 |
157- | 500, 2000 | | 21,117.34 | |
158- | 1000, 2000 | | | 10,556.00 |
159- | 1024, 2048 | | 16,859.45 | 11,584.33 |
160- | 2048, 128 | | 4,364.06 | 3,832.38 |
161- | 2048, 2048 | | 12,800.89 | |
162- | 5000, 500 | | 5,128.60 | |
61+ #### Llama 4 Scout
62+
63+ | Sequence Length (ISL/OSL) | B200<br />TP1 (FP4) | GB200<br />TP1 (FP4) | H200<br />TP4 (FP8) | H100<br />TP4 (FP8) |
64+ | ---------------------------| ---------------------| ---------------------| -------------------| -------------------|
65+ | 128/2048 | 14,699 | 15,238 | 34,316 | 15,130 |
66+ | 128/4096 | 8,932 | 9,556 | 21,332 | 8,603 |
67+ | 500/2000 | 11,977 | 11,795 | 24,630 | 12,399 |
68+ | 1000/1000 | 10,591 | 7,738 | 21,636 | 12,129 |
69+ | 1000/2000 | 9,356 | 8,581 | 18,499 | 9,838 |
70+ | 2048/128 | 3,137 | 3,295 | 3,699 | 3,253 |
71+ | 2048/2048 | 7,152 | 7,464 | 14,949 | 7,972 |
72+ | 5000/500 | 2,937 | 3,107 | 4,605 | 3,342 |
73+ | 20000/2000 | 1,644 | 1,767 | 2,105 | |
74+
75+ RTX 6000 Pro Blackwell Server Edition
76+ | Sequence Length (ISL/OSL) | ** 4 GPUs** <br />TP2,PP2 (FP4) | ** 8 GPUs** <br />TP4,PP2 (FP4) |
77+ | ---| ---| ---|
78+ | 128/2048 | 12,321 | 21,035 |
79+ | 128/4096 | 7,643 | 13,421 |
80+ | 1000/1000 | 9,476 | 15,781 |
81+ | 1000/2000 | 8,919 | 16,434 |
82+ | 2048/128 | 2,615 | 2,941 |
83+ | 2048/2048 | 6,208 | 10,410 |
84+ | 5000/500 | 2,662 | |
85+
86+ #### Llama 3.3 70B
87+
88+ | Sequence Length (ISL/OSL) | B200<br />TP1 (FP4) | GB200<br />TP1 (FP4) | H200<br />TP1 (FP8) | H100<br />TP2 (FP8) |
89+ | ---| ---| ---| ---| ---|
90+ | 128/2048 | 9,922 | 11,309 | 4,336 | 6,651 |
91+ | 128/4096 | 6,831 | 7,849 | 2,872 | 4,199 |
92+ | 500/2000 | 7,762 | 9,028 | 3,666 | 5,222 |
93+ | 1000/1000 | 7,007 | 7,326 | 2,909 | 4,205 |
94+ | 1000/2000 | 6,271 | 6,513 | 2,994 | 4,146 |
95+ | 2048/128 | 1,339 | 1,450 | 442 | 762 |
96+ | 2048/2048 | 4,783 | 5,646 | 2,003 | 3,082 |
97+ | 5000/500 | 1,459 | 1,602 | 566 | 898 |
98+ | 20000/2000 | 665 | 755 | 283 | 437 |
99+
100+ RTX 6000 Pro Blackwell Server Edition
101+ | Sequence Length (ISL/OSL) | ** 1 GPUs** <br />TP1,PP1 (FP4) | ** 2 GPUs** <br />TP1,PP2 (FP4) | ** 4 GPUs** <br />TP1,PP4 (FP4) | ** 8 GPUs** <br />TP1,PP8 (FP4) |
102+ | ---| ---| ---| ---| ---|
103+ | 128/2048 | 2,422 | 4,993 | 7,922 | 9,833 |
104+ | 128/4096 | 1,349 | 2,893 | 4,978 | 7,352 |
105+ | 500/2000 | 1,856 | 4,114 | 6,939 | 9,435 |
106+ | 1000/1000 | 1,787 | 3,707 | 5,961 | 8,166 |
107+ | 1000/2000 | 1,594 | 2,993 | 5,274 | 6,943 |
108+ | 2048/128 | 393 | 813 | 1,511 | 2,495 |
109+ | 2048/2048 | 1,074 | 2,336 | 3,870 | 6,078 |
110+ | 5000/500 | 401 | 812 | 1,511 | 2,491 |
111+ | 20000/2000 | 142 | 319 | 630 | 1,148 |
112+
113+ #### Qwen3-235B-A22B
114+
115+ | Sequence Length (ISL/OSL) | B200<br />TP8 (FP4) | H200<br />TP8 (FP8) | H100<br />TP8 (FP8) |
116+ | ---| ---| ---| ---|
117+ | 128/2048 | 66,057 | 42,821 | 19,658 |
118+ | 128/4096 | 39,496 | 26,852 | 12,447 |
119+ | 500/2000 | 57,117 | 28,026 | 18,351 |
120+ | 1000/1000 | 42,391 | 23,789 | 14,898 |
121+ | 1000/2000 | 34,105 | 22,061 | 15,136 |
122+ | 2048/128 | 7,329 | 3,331 | |
123+ | 2048/2048 | 26,854 | 16,672 | 9,924 |
124+ | 5000/500 | 8,190 | 3,623 | 3,225 |
125+ | 20000/2000 | 4,453 | 1,876 | |
126+
127+ RTX 6000 Pro Blackwell Server Edition
128+ | Sequence Length (ISL/OSL) | ** 8 GPUs** <br />TP2,PP4 (FP4) |
129+ | ---| ---|
130+ | 128/2048 | 12,494 |
131+ | 128/4096 | 7,715 |
132+ | 500/2000 | 11,157 |
133+ | 1000/1000 | 10,697 |
134+ | 1000/2000 | 10,109 |
135+ | 2048/128 | 3,181 |
136+ | 2048/2048 | 6,712 |
137+ | 5000/500 | 3,173 |
138+
139+ #### Qwen3-30B-A3B
140+
141+ | Sequence Length (ISL/OSL) | B200<br />TP1 (FP4) |
142+ | ---| ---|
143+ | 128/2048 | 37,844 |
144+ | 128/4096 | 24,953 |
145+ | 500/2000 | 27,817 |
146+ | 1000/1000 | 25,828 |
147+ | 1000/2000 | 22,051 |
148+ | 2048/128 | 6,251 |
149+ | 2048/2048 | 17,554 |
150+ | 5000/500 | 6,142 |
151+ | 20000/2000 | 2,944 |
152+
153+ RTX 6000 Pro Blackwell Server Edition
154+ | Sequence Length (ISL/OSL) | ** 1 GPUs** <br />TP1,PP1 (FP4) | ** 2 GPUs** <br />TP2,PP1 (FP4) | ** 4 GPUs** <br />TP4,PP1 (FP4) | ** 8 GPUs** <br />TP8,PP1 (FP4) |
155+ | ---| ---| ---| ---| ---|
156+ | 128/2048 | 12,540 | 22,744 | 35,715 | 52,676 |
157+ | 128/4096 | 7,491 | 15,049 | 28,139 | 33,895 |
158+ | 500/2000 | 10,695 | 17,266 | 26,175 | 44,088 |
159+ | 1000/1000 | 9,910 | 16,431 | 24,046 | 31,785 |
160+ | 1000/2000 | 8,378 | 13,323 | 25,131 | 28,881 |
161+ | 2048/128 | 3,257 | 3,785 | 4,311 | 4,798 |
162+ | 2048/2048 | 5,908 | 10,679 | 18,134 | 22,391 |
163+ | 5000/500 | 2,530 | 3,799 | 5,212 | 5,965 |
164+ | 20000/2000 | 871 | 1,558 | 2,551 | |
165+
166+ #### DeepSeek R1
167+
168+ | Sequence Length (ISL/OSL) | B200<br />TP8 (FP4) |
169+ | ---| ---|
170+ | 128/2048 | 62,599 |
171+ | 128/4096 | 44,046 |
172+ | 1000/1000 | 37,634 |
173+ | 1000/2000 | 40,538 |
174+ | 2048/128 | 5,026 |
175+ | 2048/2048 | 28,852 |
176+
177+ #### Llama 4 Maverick
178+
179+ | Sequence Length (ISL/OSL) | B200<br />TP8 (FP4) | H200<br />TP8 (FP8) | H100<br />TP8 (FP8) |
180+ | ---| ---| ---| ---|
181+ | 128/2048 | 112,676 | 40,572 | 10,829 |
182+ | 128/4096 | 68,170 | 24,616 | 6,744 |
183+ | 500/2000 | | 37,835 | 10,108 |
184+ | 1000/1000 | 79,617 | 31,782 | 9,677 |
185+ | 1000/2000 | 63,766 | 34,734 | 9,151 |
186+ | 2048/128 | 18,088 | 7,307 | |
187+ | 2048/2048 | 52,195 | 20,957 | 6,916 |
188+ | 5000/500 | | 8,456 | 3,457 |
189+ | 20000/2000 | 12,678 | 4,106 | |
190+
191+ RTX 6000 Pro Blackwell Server Edition
192+ | Sequence Length (ISL/OSL) | ** 8 GPUs** <br />TP4,PP2 (FP4) |
193+ | ---| ---|
194+ | 128/2048 | 19,146 |
195+ | 128/4096 | 12,165 |
196+ | 500/2000 | 17,870 |
197+ | 1000/1000 | 15,954 |
198+ | 1000/2000 | 12,456 |
199+ | 2048/128 | 4,463 |
200+ | 2048/2048 | 10,727 |
201+ | 5000/500 | 4,613 |
202+
203+ #### Llama 3.1 405B
204+
205+ | Sequence Length (ISL/OSL) | B200<br />TP4 (FP4) | GB200<br />TP4 (FP4) | H200<br />TP8 (FP8) | H100<br />TP8 (FP8) |
206+ | ---| ---| ---| ---| ---|
207+ | 128/2048 | 8,020 | 8,151 | 5,348 | 4,340 |
208+ | 128/4096 | 6,345 | 6,608 | 4,741 | 3,116 |
209+ | 500/2000 | 6,244 | 6,540 | 4,724 | 3,994 |
210+ | 1000/1000 | 5,209 | 5,389 | 3,330 | 2,919 |
211+ | 1000/2000 | 4,933 | 5,135 | 3,722 | 2,895 |
212+ | 2048/128 | 749 | 797 | 456 | 453 |
213+ | 2048/2048 | 4,212 | 4,407 | 2,948 | 2,296 |
214+ | 5000/500 | 1,048 | 1,112 | 650 | 610 |
215+ | 20000/2000 | 672 | 739 | 505 | 345 |
216+
217+ RTX 6000 Pro Blackwell Server Edition
218+ | Sequence Length (ISL/OSL) | ** 8 GPUs** <br />TP1,PP8 (FP4) |
219+ | ---| ---|
220+ | 128/2048 | 2,981 |
221+ | 1000/1000 | 2,369 |
222+ | 1000/2000 | 1,931 |
223+ | 2048/128 | 579 |
224+ | 2048/2048 | 1,442 |
225+
226+ #### Llama 3.1 8B
227+
228+ | Sequence Length (ISL/OSL) | H200<br />TP1 (FP8) | H100<br />TP1 (FP8) |
229+ | ---| ---| ---|
230+ | 128/2048 | 26,221 | 22,714 |
231+ | 128/4096 | 18,027 | 14,325 |
232+ | 500/2000 | 20,770 | 17,660 |
233+ | 1000/1000 | 17,744 | 15,220 |
234+ | 1000/2000 | 16,828 | 13,899 |
235+ | 2048/128 | 3,538 | 3,450 |
236+ | 2048/2048 | 12,194 | 9,305 |
237+ | 5000/500 | 3,902 | 3,459 |
238+ | 20000/2000 | 1,804 | 1,351 |
239+
240+
241+
242+
163243
164244## Reproducing Benchmarked Results
165245
@@ -185,6 +265,7 @@ Starting with v0.19, testing was performed using the PyTorch backend - this work
185265| ` $osl ` | Benchmark output sequence length. |
186266| ` $tp_size ` | Tensor parallel mapping degree to run the benchmark with |
187267| ` $pp_size ` | Pipeline parallel mapping degree to run the benchmark with |
268+ | ` $ep_size ` | Expert parallel mapping degree to run the benchmark with |
188269| ` $model_name ` | HuggingFace model name eg. meta-llama/Llama-2-7b-hf or use the path to a local weights directory |
189270| ` $dataset_file ` | Location of the dataset file generated by ` prepare_dataset.py ` |
190271| ` $num_requests ` | The number of requests to generate for dataset generation |
@@ -231,14 +312,43 @@ To run the benchmark with the generated data set, simply use the `trtllm-bench t
231312run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide
232313a model name (HuggingFace reference or path to a local model), a [ generated dataset] ( #preparing-a-dataset ) , and a file containing any desired extra options to the LLMApi (details in [ tensorrt_llm/llmapi/llm_args.py: LlmArgs ] ( ../../../tensorrt_llm/llmapi/llm_args.py ) ).
233314
315+ For dense / non-MoE models:
316+
234317``` shell
235- trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
318+ trtllm-bench --tp $tp_size --pp $pp_size --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
319+ ```
320+
321+ ` llm_options.yml `
322+ ``` yaml
323+ cuda_graph_config :
324+ enable_padding : true
325+ batch_sizes :
326+ - 1
327+ - 2
328+ - 4
329+ - 8
330+ - 16
331+ - 32
332+ - 64
333+ - 128
334+ - 256
335+ - 384
336+ - 512
337+ - 1024
338+ - 2048
339+ - 4096
340+ - 8192
236341` ` `
237342
238- The data collected for the v0.21 benchmarks was run with the following file:
343+ For MoE models:
344+
345+ ` ` ` shell
346+ trtllm-bench --tp $tp_size --pp $pp_size --ep $ep_size --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
347+ ```
239348
240349` llm_options.yml `
241350``` yaml
351+ enable_attention_dp : true
242352cuda_graph_config :
243353 enable_padding : true
244354 batch_sizes :
0 commit comments