Skip to content

Commit 5e6a1ad

Browse files
committed
Port prepare dataset to trtllm-bench.
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add MacOSX DS_Store to gitignore. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update imports. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update click group. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Updates to CLI. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Rename. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add name. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Renamed real dataset command. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Change to group. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add docstring. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Remove pass_obj. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Fix context subscription. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Updates to output. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Updates to remove stdout. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add deprecation flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Code clean up. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Fix generator call. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update prepare_dataset in docs. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update examples. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update testing for trtllm-bench dataset. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Remove trtllm-bench dataset from run_ex. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add missed __init__.py Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Re-add check for dataset subcommand. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Fix execution of trtllm-bench dataset. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
1 parent 4180417 commit 5e6a1ad

File tree

22 files changed

+738
-126
lines changed

22 files changed

+738
-126
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,6 @@ compile_commands.json
7979

8080
# Enroot sqsh files
8181
enroot/tensorrt_llm.devel.sqsh
82+
83+
# MacOSX Files
84+
.DS_Store

benchmarks/cpp/prepare_dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def validate_tokenizer(self):
4949
return self
5050

5151

52-
@click.group()
52+
@click.group(deprecated=True)
5353
@click.option(
5454
"--tokenizer",
5555
required=True,

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -248,13 +248,13 @@ To do the benchmark, run the following command:
248248

249249
```bash
250250
# generate synthetic dataset
251-
python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
252-
--stdout \
253-
--tokenizer nvidia/DeepSeek-R1-FP4 \
251+
trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
252+
dataset \
253+
--output dataset.txt \
254254
token-norm-dist \
255255
--input-mean 1024 --output-mean 2048 \
256256
--input-stdev 0 --output-stdev 0 \
257-
--num-requests 49152 > dataset.txt
257+
--num-requests 49152
258258

259259
YOUR_DATA_PATH=./dataset.txt
260260

@@ -350,13 +350,14 @@ To do the benchmark, run the following command:
350350

351351
```bash
352352
# generate synthetic dataset
353-
python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
354-
--stdout \
355-
--tokenizer deepseek-ai/DeepSeek-R1 \
353+
trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
354+
dataset \
355+
--output dataset.txt \
356356
token-norm-dist \
357357
--input-mean 1024 --output-mean 2048 \
358358
--input-stdev 0 --output-stdev 0 \
359-
--num-requests 5120 > dataset.txt
359+
--num-requests 5120
360+
360361
YOUR_DATA_PATH=./dataset.txt
361362

362363
cat >./extra-llm-api-config.yml<<EOF
@@ -401,7 +402,7 @@ Average request latency (ms): 181540.5739
401402

402403
## Exploring more ISL/OSL combinations
403404

404-
To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
405+
To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use the `trtllm-bench dataset` subcommand to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
405406
### WIP: Enable more features by default
406407

407408
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
@@ -414,7 +415,7 @@ For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max
414415

415416
### MLA chunked context
416417

417-
MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.
418+
MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.
418419

419420
For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
420421

docs/source/developer-guide/perf-analysis.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,10 +72,12 @@ Say we want to profile iterations 100 to 150 on a `trtllm-bench`/`trtllm-serve`
7272
#!/bin/bash
7373

7474
# Prepare dataset for the benchmark
75-
python3 benchmarks/cpp/prepare_dataset.py \
76-
--tokenizer=${MODEL_PATH} \
77-
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
78-
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
75+
trtllm-bench --model ${MODEL_PATH} \
76+
dataset \
77+
--output dataset.txt \
78+
token-norm-dist \
79+
--num-requests=${NUM_SAMPLES} \
80+
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0
7981

8082
# Benchmark and profile
8183
TLLM_PROFILE_START_STOP=100-150 nsys profile \

docs/source/developer-guide/perf-benchmarking.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ directory. For example, to generate a synthetic dataset of 1000 requests with a
150150
128/128 for [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), run:
151151

152152
```shell
153-
python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000 > /tmp/synthetic_128_128.txt
153+
trtllm-bench --model meta-llama/Llama-3.1-8B dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000
154154
```
155155

156156
### Running with the PyTorch Workflow
@@ -231,13 +231,13 @@ The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapt
231231

232232
**Preparing LoRA Dataset**
233233

234-
Use `prepare_dataset.py` with LoRA-specific options to generate requests with LoRA metadata:
234+
Use `trtllm-bench dataset` with LoRA-specific options to generate requests with LoRA metadata:
235235

236236
```shell
237-
python3 benchmarks/cpp/prepare_dataset.py \
238-
--stdout \
237+
trtllm-bench \
238+
--model /path/to/tokenizer \
239+
dataset \
239240
--rand-task-id 0 1 \
240-
--tokenizer /path/to/tokenizer \
241241
--lora-dir /path/to/loras \
242242
token-norm-dist \
243243
--num-requests 100 \
@@ -308,17 +308,18 @@ Each subdirectory should contain the LoRA adapter files for that specific task.
308308
To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.
309309

310310
First, prepare the dataset:
311-
```python
312-
python ./benchmarks/cpp/prepare_dataset.py \
313-
--tokenizer Qwen/Qwen2-VL-2B-Instruct \
314-
--stdout \
311+
```bash
312+
trtllm-bench \
313+
--model Qwen/Qwen2-VL-2B-Instruct \
315314
dataset \
315+
--output mm_data.jsonl
316+
real-dataset
316317
--dataset-name lmms-lab/MMMU \
317318
--dataset-split test \
318319
--dataset-image-key image \
319320
--dataset-prompt-key question \
320321
--num-requests 10 \
321-
--output-len-dist 128,5 > mm_data.jsonl
322+
--output-len-dist 128,5
322323
```
323324
It will download the media files to `/tmp` directory and prepare the dataset with their paths. Note that the `prompt` fields are texts and not tokenized ids. This is due to the fact that
324325
the `prompt` and the media (image/video) are processed by a preprocessor for multimodal files.

docs/source/legacy/performance/perf-analysis.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,10 +66,10 @@ Say we want to profile iterations 100 to 150 on a trtllm-bench/trtllm-serve run,
6666
#!/bin/bash
6767

6868
# Prepare dataset for the benchmark
69-
python3 benchmarks/cpp/prepare_dataset.py \
70-
--tokenizer=${MODEL_PATH} \
71-
--stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
72-
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
69+
trtllm-bench \
70+
--model=${MODEL_PATH} dataset \
71+
--output /tmp/dataset.txt token-norm-dist --num-requests=${NUM_SAMPLES} \
72+
--input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0
7373

7474
# Benchmark and profile
7575
TLLM_PROFILE_START_STOP=100-150 nsys profile \

docs/source/legacy/performance/perf-benchmarking.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ of 128:128.
110110
To run the benchmark from start to finish, run the following commands:
111111

112112
```shell
113-
python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > /tmp/synthetic_128_128.txt
113+
trtllm-bench --tokenizer meta-llama/Llama-3.1-8B dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000
114114
trtllm-bench --model meta-llama/Llama-3.1-8B build --dataset /tmp/synthetic_128_128.txt --quantization FP8
115115
trtllm-bench --model meta-llama/Llama-3.1-8B throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
116116
```
@@ -207,7 +207,7 @@ directory. For example, to generate a synthetic dataset of 1000 requests with a
207207
128/128 for [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), run:
208208

209209
```shell
210-
benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000 > /tmp/synthetic_128_128.txt
210+
trtllm-bench --tokenizer meta-llama/Llama-3.1-8B dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000
211211
```
212212

213213
### Building a Benchmark Engine

examples/llm-api/llm_mgmn_trtllm_bench.sh

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@
3535
# not supported in Slurm mode, you need to download the model and put it in
3636
# the LOCAL_MODEL directory.
3737

38-
export prepare_dataset="$SOURCE_ROOT/benchmarks/cpp/prepare_dataset.py"
3938
export data_path="$WORKDIR/token-norm-dist.txt"
4039

4140
echo "Preparing dataset..."
@@ -50,14 +49,14 @@ srun -l \
5049
--mpi=pmix \
5150
bash -c "
5251
$PROLOGUE
53-
python3 $prepare_dataset \
54-
--tokenizer=$LOCAL_MODEL \
55-
--stdout token-norm-dist \
52+
trtllm-bench --model=$LOCAL_MODEL dataset \
53+
--output $data_path \
54+
token-norm-dist \
5655
--num-requests=100 \
5756
--input-mean=128 \
5857
--output-mean=128 \
5958
--input-stdev=0 \
60-
--output-stdev=0 > $data_path
59+
--output-stdev=0
6160
"
6261

6362
echo "Running benchmark..."

examples/llm-api/out_of_tree_example/readme.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,17 @@ Similar to the quickstart example, you can use the same CLI argument with `trtll
4242
4343
Prepare the dataset:
4444
```
45-
python ./benchmarks/cpp/prepare_dataset.py --tokenizer ./model_ckpt --stdout dataset --dataset-name lmms-lab/MMMU --dataset-split test --dataset-image-key image --dataset-prompt-key "question" --num-requests 100 --output-len-dist 128,5 > mm_data.jsonl
45+
trtllm-bench \
46+
--model ./model_ckpt \
47+
dataset \
48+
--output mm_data.jsonl
49+
real-dataset
50+
--dataset-name lmms-lab/MMMU \
51+
--dataset-split test \
52+
--dataset-image-key image \
53+
--dataset-prompt-key question \
54+
--num-requests 10 \
55+
--output-len-dist 128,5
4656
```
4757
4858

examples/models/core/deepseek_v3/README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -138,12 +138,13 @@ To avoid OOM (out of memory) error, you need to adjust the values of "--max_batc
138138
#### ISL-64k-OSL-1024
139139
```bash
140140
DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1
141-
python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
142-
--stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \
141+
trtllm-bench --model ${DS_R1_NVFP4_MODEL_PATH} \
142+
dataset \
143+
--output /tmp/benchmarking_64k.txt \
143144
token-norm-dist \
144145
--input-mean 65536 --output-mean 1024 \
145146
--input-stdev 0 --output-stdev 0 \
146-
--num-requests 24 > /tmp/benchmarking_64k.txt
147+
--num-requests 24
147148

148149
cat <<EOF > /tmp/extra-llm-api-config.yml
149150
cuda_graph_config:
@@ -164,12 +165,13 @@ trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} t
164165
#### ISL-128k-OSL-1024
165166
```bash
166167
DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1
167-
python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
168-
--stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \
168+
trtllm-bench --model ${DS_R1_NVFP4_MODEL_PATH} \
169+
dataset \
170+
--output /tmp/benchmarking_128k.txt \
169171
token-norm-dist \
170172
--input-mean 131072 --output-mean 1024 \
171173
--input-stdev 0 --output-stdev 0 \
172-
--num-requests 4 > /tmp/benchmarking_128k.txt
174+
--num-requests 4
173175

174176
cat <<EOF > /tmp/extra-llm-api-config.yml
175177
cuda_graph_config:
@@ -336,7 +338,7 @@ curl http://localhost:8000/v1/completions \
336338
}'
337339
```
338340

339-
For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.
341+
For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.
340342
For DeepSeek-V3, use the model name `deepseek-ai/DeepSeek-V3`.
341343

342344
### Disaggregated Serving

0 commit comments

Comments
 (0)