[TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench#9250
[TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench#9250FrankD412 merged 12 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThis pull request adds dataset preparation infrastructure to the TensorRT LLM benchmarking suite, introducing CLI commands to generate synthetic datasets and prepare real HuggingFace datasets, along with supporting data models and utilities. It also updates the main bench CLI to register the new prepare-dataset command and adds .DS_Store to .gitignore. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PrepareDataset as prepare_dataset<br/>(CLI Group)
participant RootArgs as RootArgs<br/>(Validation)
participant Subcommands as Subcommands<br/>(real-dataset,<br/>token_norm_dist,<br/>token_unif_dist)
participant Utils as Dataset Utils<br/>(Serialization &<br/>Sampling)
participant Output as Output File
User->>PrepareDataset: Run CLI with options
PrepareDataset->>RootArgs: Parse CLI args & build model
activate RootArgs
RootArgs->>RootArgs: Load & validate tokenizer
RootArgs-->>PrepareDataset: Return validated args
deactivate RootArgs
PrepareDataset->>Subcommands: Invoke subcommand<br/>(real-dataset or synthetic)
activate Subcommands
alt Real Dataset Path
Subcommands->>Subcommands: Load HF dataset<br/>& DatasetConfig
Subcommands->>Utils: Generate samples<br/>(text or multimodal)
else Synthetic Dataset Path
Subcommands->>Utils: Sample lengths from<br/>distribution
Subcommands->>Utils: Generate random tokens
end
Subcommands->>Utils: Serialize Workload<br/>(TextSample/MultimodalSample)
Utils->>Output: Write JSON to file
Subcommands-->>User: Complete
deactivate Subcommands
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (8)
tensorrt_llm/bench/dataset/prepare_dataset.py (1)
34-49: Improve exception chaining for better debugging.The exception handling would benefit from using
raise ... from eto preserve the original traceback context.Apply this diff:
except EnvironmentError as e: - raise ValueError( + raise ValueError( "Cannot find a tokenizer from the given string because of " f"{e}\nPlease set tokenizer to the directory that contains " "the tokenizer, or set to a model name in HuggingFace." - ) + ) from etensorrt_llm/bench/dataset/prepare_synthetic_data.py (3)
15-38: Add stacklevel to warning for better debugging.The warning should include
stacklevel=2to point to the caller's code rather than this helper function.Apply this diff:
if use_task_ids and not use_lora: warnings.warn( - "Task IDs require LoRA directory. Use --lora-dir or omit task IDs.", UserWarning + "Task IDs require LoRA directory. Use --lora-dir or omit task IDs.", + UserWarning, + stacklevel=2 )
50-52: Remove unused variable initializations.These variables are immediately overwritten and never used in their initial state.
Apply this diff:
- input_ids = [] - input_lens = [] - output_lens = [] - input_lens = get_norm_dist_lengths(
110-112: Remove unused variable initializations.These variables are immediately overwritten and never used in their initial state.
Apply this diff:
- input_ids = [] - input_lens = [] - output_lens = [] - input_lens = get_unif_dist_lengths(tensorrt_llm/bench/dataset/prepare_real_data.py (2)
21-32: Use Click's BadParameter exception for better error reporting.Using
click.BadParameterprovides better integration with Click's error handling and gives users clearer feedback about which parameter failed.Apply this diff:
def validate_output_len_dist(ctx, param, value): """Validate the --output-len-dist option.""" if value is None: return value m = re.match(r"(\d+),(\d+)", value) if m: return int(m.group(1)), int(m.group(2)) else: - raise AssertionError( + raise click.BadParameter( "Incorrect specification for --output-len-dist. Correct format: " "--output-len-dist <output_len_mean>,<output_len_stdev>" )
55-61: Use ValueError instead of AssertionError in validator.Validators should raise
ValueErrorfor invalid data rather thanAssertionError, which is typically reserved for internal invariants.Apply this diff:
@model_validator(mode="after") def check_prompt(self) -> "DatasetConfig": if self.prompt_key and self.prompt: - raise AssertionError("--prompt-key and --prompt cannot be set at the same time.") + raise ValueError("--prompt-key and --prompt cannot be set at the same time.") if (not self.prompt_key) and (not self.prompt): - raise AssertionError("Either --prompt-key or --prompt must be set.") + raise ValueError("Either --prompt-key or --prompt must be set.") return selftensorrt_llm/bench/dataset/utils.py (2)
29-40: Consider using Pydantic's standard patterns.The custom
__init__could be replaced with a@model_validator(mode='after')ormodel_post_inithook for better alignment with Pydantic conventions.Example using
@model_validator:class Workload(BaseModel): metadata: dict samples: List[Union[TextSample, MultimodalSample]] = [] - def __init__(self, **kwargs) -> None: - super().__init__(**kwargs) - self.setup_workload_name() - - def setup_workload_name(self): + @model_validator(mode='after') + def setup_workload_name(self): # Keys to ignore ignore_keys = ["tokenizer"] # Create a string by concatenating keys and values with "__" workload_name = "__".join( f"{key}:{value}" for key, value in self.metadata.items() if key not in ignore_keys ) self.metadata.setdefault("workload_name", workload_name) + return self
96-104: Add strict parameter to zip for safety.Without
strict=True, mismatched list lengths would be silently truncated, potentially hiding bugs.Apply this diff:
def print_multimodal_dataset(multimodal_texts, multimodal_image_paths, output_lens): - for i, (text, image_paths) in enumerate(zip(multimodal_texts, multimodal_image_paths)): + for i, (text, image_paths) in enumerate(zip(multimodal_texts, multimodal_image_paths, strict=True)): d = { "task_id": i, "prompt": text, "media_paths": image_paths, "output_tokens": output_lens[i], } yield json.dumps(d, separators=(",", ":"), ensure_ascii=False)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.gitignore(1 hunks)tensorrt_llm/bench/dataset/prepare_dataset.py(1 hunks)tensorrt_llm/bench/dataset/prepare_real_data.py(1 hunks)tensorrt_llm/bench/dataset/prepare_synthetic_data.py(1 hunks)tensorrt_llm/bench/dataset/utils.py(1 hunks)tensorrt_llm/commands/bench.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-18T08:42:02.640Z
Learnt from: samuellees
Repo: NVIDIA/TensorRT-LLM PR: 6974
File: tensorrt_llm/serve/scripts/benchmark_dataset.py:558-566
Timestamp: 2025-08-18T08:42:02.640Z
Learning: In TensorRT-LLM's RandomDataset (tensorrt_llm/serve/scripts/benchmark_dataset.py), when using --random-token-ids option, sequence length accuracy is prioritized over semantic correctness for benchmarking purposes. The encode/decode operations should use skip_special_tokens=True and add_special_tokens=False to ensure exact target token lengths.
Applied to files:
tensorrt_llm/bench/dataset/prepare_dataset.pytensorrt_llm/bench/dataset/utils.pytensorrt_llm/bench/dataset/prepare_synthetic_data.py
🧬 Code graph analysis (4)
tensorrt_llm/commands/bench.py (1)
tensorrt_llm/bench/dataset/prepare_dataset.py (1)
prepare_dataset(73-84)
tensorrt_llm/bench/dataset/prepare_dataset.py (4)
tensorrt_llm/bench/dataset/prepare_real_data.py (1)
real_dataset(209-314)tensorrt_llm/bench/dataset/prepare_synthetic_data.py (2)
token_norm_dist(48-90)token_unif_dist(108-150)tensorrt_llm/builder.py (1)
default(45-50)tensorrt_llm/bench/benchmark/__init__.py (1)
checkpoint_path(75-76)
tensorrt_llm/bench/dataset/prepare_real_data.py (1)
tensorrt_llm/bench/dataset/utils.py (3)
get_norm_dist_lengths(122-126)print_multimodal_dataset(96-104)print_text_dataset(79-93)
tensorrt_llm/bench/dataset/prepare_synthetic_data.py (1)
tensorrt_llm/bench/dataset/utils.py (5)
gen_random_tokens(136-157)get_norm_dist_lengths(122-126)get_unif_dist_lengths(129-133)print_text_dataset(79-93)text_dataset_dump(43-57)
🪛 Ruff (0.14.5)
tensorrt_llm/bench/dataset/prepare_dataset.py
41-45: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
41-45: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/bench/dataset/utils.py
97-97: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
142-142: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
tensorrt_llm/bench/dataset/prepare_real_data.py
21-21: Unused function argument: ctx
(ARG001)
21-21: Unused function argument: param
(ARG001)
29-32: Avoid specifying long messages outside the exception class
(TRY003)
58-58: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
109-115: Avoid specifying long messages outside the exception class
(TRY003)
150-150: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
247-247: Avoid specifying long messages outside the exception class
(TRY003)
268-268: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
tensorrt_llm/bench/dataset/prepare_synthetic_data.py
21-21: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
30-30: No explicit stacklevel keyword argument found
Set stacklevel=2
(B028)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
.gitignore (1)
88-90: ✓ Sensible addition to ignore macOS system files.Adding
.DS_Storeto.gitignoreis a standard best practice that prevents macOS Finder metadata from being committed. The comment clearly delineates the macOS-specific section, and the placement is logical.tensorrt_llm/commands/bench.py (1)
9-9: LGTM! Clean integration of the new dataset preparation command.The import and command registration follow the existing pattern consistently.
Also applies to: 61-61
tensorrt_llm/bench/dataset/prepare_dataset.py (2)
25-32: LGTM! Well-structured data model.The RootArgs model cleanly captures all necessary configuration parameters for dataset preparation.
87-89: LGTM! Command registration follows Click conventions.tensorrt_llm/bench/dataset/utils.py (2)
43-76: LGTM! Clean serialization functions.Both dump functions properly handle directory creation and JSON serialization.
136-157: LGTM! Robust random token generation.The function correctly handles EOS token exclusion and uses deterministic seeding for reproducibility. Based on learnings, this aligns with TensorRT-LLM's approach of prioritizing sequence length accuracy over semantic correctness for benchmarking.
09f8d9e to
c0368cb
Compare
|
/bot run |
|
PR_Github #25125 [ run ] triggered by Bot. Commit: |
|
PR_Github #25125 [ run ] completed with state |
e3303e4 to
61ec730
Compare
|
/bot run |
|
PR_Github #25237 [ run ] triggered by Bot. Commit: |
|
PR_Github #25237 [ run ] completed with state |
|
/bot run |
|
PR_Github #25244 [ run ] triggered by Bot. Commit: |
|
PR_Github #25244 [ run ] completed with state |
9fac95c to
2523e3d
Compare
|
/bot run |
|
PR_Github #25260 [ run ] triggered by Bot. Commit: |
|
PR_Github #25260 [ run ] completed with state |
05962eb to
5e6a1ad
Compare
|
PR_Github #26700 [ run ] completed with state |
|
/bot run |
|
PR_Github #26706 [ run ] triggered by Bot. Commit: |
|
PR_Github #26703 [ run ] completed with state |
|
PR_Github #26706 [ run ] completed with state |
|
/bot run |
|
PR_Github #26750 [ run ] triggered by Bot. Commit: |
|
PR_Github #26750 [ run ] completed with state |
|
/bot run |
|
PR_Github #27171 [ run ] triggered by Bot. Commit: |
|
PR_Github #27171 [ run ] completed with state |
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add MacOSX DS_Store to gitignore. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update imports. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update click group. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Updates to CLI. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Rename. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add name. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Renamed real dataset command. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Change to group. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add docstring. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Remove pass_obj. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Fix context subscription. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Updates to output. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Updates to remove stdout. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add deprecation flag. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Code clean up. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Fix generator call. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update prepare_dataset in docs. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update examples. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Update testing for trtllm-bench dataset. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Remove trtllm-bench dataset from run_ex. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Add missed __init__.py Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Re-add check for dataset subcommand. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Fix execution of trtllm-bench dataset. Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
|
/bot reuse-pipeline 20733 |
|
PR_Github #27338 Bot args parsing error: usage: /bot [-h] |
|
/bot skip --comment "Passed previously, rebased to fix conflict in docs." |
|
PR_Github #27341 [ skip ] triggered by Bot. Commit: |
|
PR_Github #27341 [ skip ] completed with state |
Summary by CodeRabbit
New Features
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.