NVIDIA
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎benchmarks/cpp/prepare_dataset.py‎
Lines changed: 1 addition & 1 deletion b/‎benchmarks/cpp/prepare_dataset.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md‎
Lines changed: 11 additions & 10 deletions b/‎docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md‎
Lines changed: 11 additions & 10 deletions
diff --git a/‎docs/source/developer-guide/perf-analysis.md‎
Lines changed: 6 additions & 4 deletions b/‎docs/source/developer-guide/perf-analysis.md‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎docs/source/developer-guide/perf-benchmarking.md‎
Lines changed: 11 additions & 10 deletions b/‎docs/source/developer-guide/perf-benchmarking.md‎
Lines changed: 11 additions & 10 deletions
diff --git a/‎docs/source/legacy/performance/perf-analysis.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/legacy/performance/perf-analysis.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/legacy/performance/perf-benchmarking.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/legacy/performance/perf-benchmarking.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/llm-api/llm_mgmn_trtllm_bench.sh‎
Lines changed: 4 additions & 5 deletions b/‎examples/llm-api/llm_mgmn_trtllm_bench.sh‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎examples/llm-api/out_of_tree_example/readme.md‎
Lines changed: 11 additions & 1 deletion b/‎examples/llm-api/out_of_tree_example/readme.md‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎examples/models/core/deepseek_v3/README.md‎
Lines changed: 9 additions & 7 deletions b/‎examples/models/core/deepseek_v3/README.md‎
Lines changed: 9 additions & 7 deletions
@@ -79,3 +79,6 @@ compile_commands.json
 
 # Enroot sqsh files
 enroot/tensorrt_llm.devel.sqsh
+
+# MacOSX Files
+.DS_Store
@@ -49,7 +49,7 @@ def validate_tokenizer(self):
         return self
 
 
-@click.group()
+@click.group(deprecated=True)
 @click.option(
     "--tokenizer",
     required=True,
 
@@ -248,13 +248,13 @@ To do the benchmark, run the following command:
 
 ```bash
 # generate synthetic dataset
-python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
-        --stdout \
-        --tokenizer nvidia/DeepSeek-R1-FP4 \
+trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
+        dataset \
+        --output dataset.txt \
         token-norm-dist \
         --input-mean 1024 --output-mean 2048 \
         --input-stdev 0 --output-stdev 0 \
-        --num-requests 49152 > dataset.txt
+        --num-requests 49152
 
 YOUR_DATA_PATH=./dataset.txt
 
@@ -350,13 +350,14 @@ To do the benchmark, run the following command:
 
 ```bash
 # generate synthetic dataset
-python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
-        --stdout \
-        --tokenizer deepseek-ai/DeepSeek-R1 \
+trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
+        dataset \
+        --output dataset.txt \
         token-norm-dist \
         --input-mean 1024 --output-mean 2048 \
         --input-stdev 0 --output-stdev 0 \
-        --num-requests 5120 > dataset.txt
+        --num-requests 5120
+
 YOUR_DATA_PATH=./dataset.txt
 
 cat >./extra-llm-api-config.yml<<EOF
@@ -401,7 +402,7 @@ Average request latency (ms):                     181540.5739
 
 ## Exploring more ISL/OSL combinations
 
-To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
+To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use the `trtllm-bench dataset` subcommand to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
 ### WIP: Enable more features by default
 
 Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
@@ -414,7 +415,7 @@ For more details on `max_batch_size` and `max_num_tokens`, refer to [Tuning Max
 
 ### MLA chunked context
 
-MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered. 
+MLA currently supports the chunked context feature on both Hopper and Blackwell GPUs. You can use `--enable_chunked_context` to enable it. This feature is primarily designed to reduce TPOT (Time Per Output Token). The default chunk size is set to `max_num_tokens`. If you want to achieve a lower TPOT, you can appropriately reduce the chunk size. However, please note that this will also decrease overall throughput. Therefore, a trade-off needs to be considered.
 
 For more details on `max_num_tokens`, refer to [Tuning Max Batch Size and Max Num Tokens](../performance/performance-tuning-guide/tuning-max-batch-size-and-max-num-tokens.md).
 
 
@@ -72,10 +72,12 @@ Say we want to profile iterations 100 to 150 on a `trtllm-bench`/`trtllm-serve`
 #!/bin/bash
 
 # Prepare dataset for the benchmark
-python3 benchmarks/cpp/prepare_dataset.py \
-    --tokenizer=${MODEL_PATH} \
-    --stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
-    --input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
+trtllm-bench --model ${MODEL_PATH} \
+    dataset \
+    --output dataset.txt \
+    token-norm-dist \
+    --num-requests=${NUM_SAMPLES} \
+    --input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0
 
 # Benchmark and profile
 TLLM_PROFILE_START_STOP=100-150 nsys profile \
 
@@ -150,7 +150,7 @@ directory. For example, to generate a synthetic dataset of 1000 requests with a
 128/128 for [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), run:
 
 ```shell
-python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000 > /tmp/synthetic_128_128.txt
+trtllm-bench --model meta-llama/Llama-3.1-8B dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000
 ```
 
 ### Running with the PyTorch Workflow
@@ -231,13 +231,13 @@ The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapt
 
 **Preparing LoRA Dataset**
 
-Use `prepare_dataset.py` with LoRA-specific options to generate requests with LoRA metadata:
+Use `trtllm-bench dataset` with LoRA-specific options to generate requests with LoRA metadata:
 
 ```shell
-python3 benchmarks/cpp/prepare_dataset.py \
-  --stdout \
+trtllm-bench \
+  --model /path/to/tokenizer \
+  dataset \
   --rand-task-id 0 1 \
-  --tokenizer /path/to/tokenizer \
   --lora-dir /path/to/loras \
   token-norm-dist \
   --num-requests 100 \
@@ -308,17 +308,18 @@ Each subdirectory should contain the LoRA adapter files for that specific task.
 To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.
 
 First, prepare the dataset:
-```python
-python ./benchmarks/cpp/prepare_dataset.py \
-  --tokenizer Qwen/Qwen2-VL-2B-Instruct \
-  --stdout \
+```bash
+trtllm-bench \
+  --model Qwen/Qwen2-VL-2B-Instruct \
   dataset \
+  --output mm_data.jsonl
+  real-dataset
   --dataset-name lmms-lab/MMMU \
   --dataset-split test \
   --dataset-image-key image \
   --dataset-prompt-key question \
   --num-requests 10 \
-  --output-len-dist 128,5 > mm_data.jsonl
+  --output-len-dist 128,5
 ```
 It will download the media files to `/tmp` directory and prepare the dataset with their paths. Note that the `prompt` fields are texts and not tokenized ids. This is due to the fact that
 the `prompt` and the media (image/video) are processed by a preprocessor for multimodal files.
 
@@ -66,10 +66,10 @@ Say we want to profile iterations 100 to 150 on a trtllm-bench/trtllm-serve run,
 #!/bin/bash
 
 # Prepare dataset for the benchmark
-python3 benchmarks/cpp/prepare_dataset.py \
-    --tokenizer=${MODEL_PATH} \
-    --stdout token-norm-dist --num-requests=${NUM_SAMPLES} \
-    --input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
+trtllm-bench \
+    --model=${MODEL_PATH} dataset \
+    --output /tmp/dataset.txt token-norm-dist --num-requests=${NUM_SAMPLES} \
+    --input-mean=1000 --output-mean=1000 --input-stdev=0 --output-stdev=0
 
 # Benchmark and profile
 TLLM_PROFILE_START_STOP=100-150 nsys profile \
 
@@ -110,7 +110,7 @@ of 128:128.
 To run the benchmark from start to finish, run the following commands:
 
 ```shell
-python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > /tmp/synthetic_128_128.txt
+trtllm-bench --tokenizer meta-llama/Llama-3.1-8B dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000
 trtllm-bench --model meta-llama/Llama-3.1-8B build --dataset /tmp/synthetic_128_128.txt --quantization FP8
 trtllm-bench --model meta-llama/Llama-3.1-8B throughput --dataset /tmp/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-3.1-8B/tp_1_pp_1
 ```
@@ -207,7 +207,7 @@ directory. For example, to generate a synthetic dataset of 1000 requests with a
 128/128 for [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), run:
 
 ```shell
-benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000 > /tmp/synthetic_128_128.txt
+trtllm-bench --tokenizer meta-llama/Llama-3.1-8B dataset --output /tmp/synthetic_128_128.txt token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000
 ```
 
 ### Building a Benchmark Engine
 
@@ -35,7 +35,6 @@
 #      not supported in Slurm mode, you need to download the model and put it in
 #      the LOCAL_MODEL directory.
 
-export prepare_dataset="$SOURCE_ROOT/benchmarks/cpp/prepare_dataset.py"
 export data_path="$WORKDIR/token-norm-dist.txt"
 
 echo "Preparing dataset..."
@@ -50,14 +49,14 @@ srun -l \
     --mpi=pmix \
     bash -c "
         $PROLOGUE
-        python3 $prepare_dataset \
-            --tokenizer=$LOCAL_MODEL \
-            --stdout token-norm-dist \
+        trtllm-bench --model=$LOCAL_MODEL dataset \
+            --output $data_path \
+            token-norm-dist \
             --num-requests=100 \
             --input-mean=128 \
             --output-mean=128 \
             --input-stdev=0 \
-            --output-stdev=0 > $data_path
+            --output-stdev=0
     "
 
 echo "Running benchmark..."
 
@@ -42,7 +42,17 @@ Similar to the quickstart example, you can use the same CLI argument with `trtll
 
 Prepare the dataset:
 ```
-python ./benchmarks/cpp/prepare_dataset.py --tokenizer ./model_ckpt --stdout dataset --dataset-name lmms-lab/MMMU --dataset-split test --dataset-image-key image --dataset-prompt-key "question" --num-requests 100 --output-len-dist 128,5 > mm_data.jsonl
+trtllm-bench \
+  --model ./model_ckpt  \
+  dataset \
+  --output mm_data.jsonl
+  real-dataset
+  --dataset-name lmms-lab/MMMU \
+  --dataset-split test \
+  --dataset-image-key image \
+  --dataset-prompt-key question \
+  --num-requests 10 \
+  --output-len-dist 128,5
 ```
 
 
 
@@ -138,12 +138,13 @@ To avoid OOM (out of memory) error, you need to adjust the values of "--max_batc
 #### ISL-64k-OSL-1024
 ```bash
 DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1
-python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
-        --stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \
+trtllm-bench --model ${DS_R1_NVFP4_MODEL_PATH} \
+        dataset \
+        --output /tmp/benchmarking_64k.txt \
         token-norm-dist \
         --input-mean 65536 --output-mean 1024 \
         --input-stdev 0 --output-stdev 0 \
-        --num-requests 24 > /tmp/benchmarking_64k.txt
+        --num-requests 24
 
 cat <<EOF > /tmp/extra-llm-api-config.yml
 cuda_graph_config:
@@ -164,12 +165,13 @@ trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} t
 #### ISL-128k-OSL-1024
 ```bash
 DS_R1_NVFP4_MODEL_PATH=/path/to/DeepSeek-R1
-python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
-        --stdout --tokenizer ${DS_R1_NVFP4_MODEL_PATH} \
+trtllm-bench --model ${DS_R1_NVFP4_MODEL_PATH} \
+        dataset \
+        --output /tmp/benchmarking_128k.txt \
         token-norm-dist \
         --input-mean 131072 --output-mean 1024 \
         --input-stdev 0 --output-stdev 0 \
-        --num-requests 4 > /tmp/benchmarking_128k.txt
+        --num-requests 4
 
 cat <<EOF > /tmp/extra-llm-api-config.yml
 cuda_graph_config:
@@ -336,7 +338,7 @@ curl http://localhost:8000/v1/completions \
   }'
 ```
 
-For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.  
+For DeepSeek-R1 FP4, use the model name `nvidia/DeepSeek-R1-FP4-v2`.
 For DeepSeek-V3, use the model name `deepseek-ai/DeepSeek-V3`.
 
 ### Disaggregated Serving