Skip to content

Conversation

@cjluo-nv
Copy link
Collaborator

@cjluo-nv cjluo-nv commented Oct 9, 2025

What does this PR do?

Type of change: ? Update default calibration dataset
Overview: ?

We now plan to use the mixture of the cnn_dailymail and nvidia/Nemotron-Post-Training-Dataset-v2 as the default dataset as it shows overall the same or better model accuracy after PTQ, especially for AWQ tasks.

This PR also

  1. Add the nvidia/Nemotron-Post-Training-Dataset-v1 dataset which includes the reasoning as an alternative
  2. Add --calib_seq flag in the PTQ script so user can specify the max calib sample seq length
  3. Increase the max calib batch size from 64 to 512 to accelerate PTQ if the batch size is auto inferred.

Testing

Benchmarked with gpqa and AIME comparing cnn_dailymail calibration vs nemotron dataset calibration.

Summary by CodeRabbit

  • New Features
    • Default calibration dataset updated to Nemotron Post-Training Dataset v2, with optional v1 support.
    • Added CLI option to set calibration sequence length (calib_seq, default 512).
    • Scripts now accept CALIB_SEQ to pass calibration sequence length through command-line.
    • Expanded calibration batch size handling to support larger batches for improved throughput.
  • Documentation
    • Changelog updated to reflect new datasets, default selection, and calib_seq option.

@cjluo-nv cjluo-nv requested review from a team as code owners October 9, 2025 19:43
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Adds a new CLI option --calib_seq (default 512); introduces Nemotron post-training datasets as defaults; threads calib_seq into calibration dataloaders and batch-size estimation; updates shell scripts to pass calib_seq; and raises memory-based batch-size thresholds in dataset utilities.

Changes

Cohort / File(s) Summary
PTQ calibration CLI and flow
examples/llm_ptq/hf_ptq.py
Adds public --calib_seq (int, default 512); uses it as max_sample_length for calibration dataloaders and when computing batch sizes; adjusts default datasets to include nemotron-post-training-dataset-v2 and emits combined warnings when multiple datasets are used; aligns calib_size to dataset count.
Shell scripts: arg parsing and invocation
examples/llm_ptq/scripts/parser.sh, examples/llm_ptq/scripts/huggingface_example.sh
Adds calib_seq option parsing and DEFAULT_CALIB_SEQ=512 in parser.sh; initializes and echoes CALIB_SEQ; huggingface_example.sh conditionally appends --calib_seq=$CALIB_SEQ to PTQ invocation when set.
Dataset utilities
modelopt/torch/utils/dataset_utils.py
Registers nemotron-post-training-dataset-v2 and nemotron-post-training-dataset-v1 in SUPPORTED_DATASET_CONFIG with splits and a preprocess that joins sample["messages"]; increases memory-threshold logic in get_max_batch_size from 64 to 512 (fallback/limit adjustments).
Docs/changelog
CHANGELOG.rst
Documents new Nemotron datasets and the calib_seq option for examples/llm_ptq.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Parser as parser.sh
  participant Example as huggingface_example.sh
  participant PTQ as hf_ptq.py
  participant DS as dataset_utils.py

  User->>Parser: provide options (may include calib_seq)
  Parser-->>Parser: parse, set CALIB_SEQ (default 512)
  Parser->>Example: export CALIB_SEQ
  Example->>PTQ: invoke hf_ptq.py (includes --calib_seq if set)
  PTQ->>PTQ: parse args (calib_seq) and fix calib_size to dataset count
  PTQ->>DS: get_dataset_dataloader(max_sample_length=calib_seq)
  DS-->>PTQ: dataloader (Nemotron datasets available)
  PTQ->>DS: get_max_batch_size(..., max_sample_length=calib_seq)
  DS-->>PTQ: batch size computed (memory thresholds up to 512)
  PTQ->>PTQ: run calibration / sparsification / quantization / export with calib_seq
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A hop through lines and token threads,
Calib set to five-twelve ahead.
Nemotron messages stitched in a row,
Scripts pass the number so pipelines know.
Bunny bounces—calibrate, then go! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately describes the primary change to use the Nemotron post-training dataset for calibration and references the relevant ticket identifier.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chenjiel/use_nemotron_dataset

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc222fb and 62f62f4.

📒 Files selected for processing (2)
  • CHANGELOG.rst (1 hunks)
  • examples/llm_ptq/hf_ptq.py (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (4)
examples/llm_ptq/hf_ptq.py (4)

299-307: Verify the performance impact of using two default datasets.

Defaulting to both cnn_dailymail and nemotron-post-training-dataset-v2 will increase calibration time compared to the previous single-dataset default. Additionally, the calib_size adjustment logic at lines 305-307 automatically extends or truncates calib_size to match the dataset count, which applies even when users explicitly provide datasets. For example, if a user specifies three datasets but only one calib_size value, the last value will be replicated for all remaining datasets.

Consider documenting this behavior change and the automatic adjustment logic to set clear expectations about calibration time and dataset handling.

</example_response>


358-358: LGTM!

Correctly threads args.calib_seq into the sparsity calibration dataloader, providing users explicit control over the maximum sample length for sparsification.

</review_comment_end -->


400-400: LGTM!

Appropriately uses args.calib_seq for batch size estimation, ensuring the calculated batch size reflects the specified maximum sequence length.

</review_comment_end -->


691-696: LGTM!

The new --calib_seq argument is well-defined with a clear help message and sensible default of 512.

</review_comment_end -->


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cjluo-nv cjluo-nv force-pushed the chenjiel/use_nemotron_dataset branch from 86bcc39 to dc222fb Compare October 9, 2025 19:47
@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.37%. Comparing base (5b02483) to head (62f62f4).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/utils/dataset_utils.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #420   +/-   ##
=======================================
  Coverage   73.36%   73.37%           
=======================================
  Files         180      180           
  Lines       17919    17934   +15     
=======================================
+ Hits        13147    13159   +12     
- Misses       4772     4775    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.
- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general question, what is the difference between calib_size and calib_seq?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calib_size is the num of calib samples.
calib_seq is the length of the max calib sample sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
@cjluo-nv should we use be more verbose and use --calib_seq_len for clarity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. Personally I prefer shorter flags.

CHANGELOG.rst Outdated
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default changed from ``cnn_dailymail`` to ``nemotron-post-training-dataset-v2`` if no dataset is specified.

@cjluo-nv cjluo-nv changed the title Use nemotron post training dataset for calibration [OMNIML-2791] Use nemotron post training dataset for calibration Oct 13, 2025
@cjluo-nv cjluo-nv merged commit 32d168c into main Oct 14, 2025
30 of 32 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/use_nemotron_dataset branch October 14, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants