[OMNIML-2791] Use nemotron post training dataset for calibration #420

cjluo-nv · 2025-10-09T19:43:27Z

What does this PR do?

Type of change: ? Update default calibration dataset
Overview: ?

We now plan to use the mixture of the cnn_dailymail and nvidia/Nemotron-Post-Training-Dataset-v2 as the default dataset as it shows overall the same or better model accuracy after PTQ, especially for AWQ tasks.

This PR also

Add the nvidia/Nemotron-Post-Training-Dataset-v1 dataset which includes the reasoning as an alternative
Add --calib_seq flag in the PTQ script so user can specify the max calib sample seq length
Increase the max calib batch size from 64 to 512 to accelerate PTQ if the batch size is auto inferred.

Testing

Benchmarked with gpqa and AIME comparing cnn_dailymail calibration vs nemotron dataset calibration.

Summary by CodeRabbit

New Features
- Default calibration dataset updated to Nemotron Post-Training Dataset v2, with optional v1 support.
- Added CLI option to set calibration sequence length (calib_seq, default 512).
- Scripts now accept CALIB_SEQ to pass calibration sequence length through command-line.
- Expanded calibration batch size handling to support larger batches for improved throughput.
Documentation
- Changelog updated to reflect new datasets, default selection, and calib_seq option.

coderabbitai · 2025-10-09T19:43:52Z

Walkthrough

Adds a new CLI option --calib_seq (default 512); introduces Nemotron post-training datasets as defaults; threads calib_seq into calibration dataloaders and batch-size estimation; updates shell scripts to pass calib_seq; and raises memory-based batch-size thresholds in dataset utilities.

Changes

Cohort / File(s)	Summary
PTQ calibration CLI and flow `examples/llm_ptq/hf_ptq.py`	Adds public `--calib_seq` (int, default 512); uses it as `max_sample_length` for calibration dataloaders and when computing batch sizes; adjusts default datasets to include `nemotron-post-training-dataset-v2` and emits combined warnings when multiple datasets are used; aligns `calib_size` to dataset count.
Shell scripts: arg parsing and invocation `examples/llm_ptq/scripts/parser.sh`, `examples/llm_ptq/scripts/huggingface_example.sh`	Adds `calib_seq` option parsing and `DEFAULT_CALIB_SEQ=512` in `parser.sh`; initializes and echoes `CALIB_SEQ`; `huggingface_example.sh` conditionally appends `--calib_seq=$CALIB_SEQ` to PTQ invocation when set.
Dataset utilities `modelopt/torch/utils/dataset_utils.py`	Registers `nemotron-post-training-dataset-v2` and `nemotron-post-training-dataset-v1` in `SUPPORTED_DATASET_CONFIG` with splits and a preprocess that joins `sample["messages"]`; increases memory-threshold logic in `get_max_batch_size` from 64 to 512 (fallback/limit adjustments).
Docs/changelog `CHANGELOG.rst`	Documents new Nemotron datasets and the `calib_seq` option for examples/llm_ptq.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Parser as parser.sh
  participant Example as huggingface_example.sh
  participant PTQ as hf_ptq.py
  participant DS as dataset_utils.py

  User->>Parser: provide options (may include calib_seq)
  Parser-->>Parser: parse, set CALIB_SEQ (default 512)
  Parser->>Example: export CALIB_SEQ
  Example->>PTQ: invoke hf_ptq.py (includes --calib_seq if set)
  PTQ->>PTQ: parse args (calib_seq) and fix calib_size to dataset count
  PTQ->>DS: get_dataset_dataloader(max_sample_length=calib_seq)
  DS-->>PTQ: dataloader (Nemotron datasets available)
  PTQ->>DS: get_max_batch_size(..., max_sample_length=calib_seq)
  DS-->>PTQ: batch size computed (memory thresholds up to 512)
  PTQ->>PTQ: run calibration / sparsification / quantization / export with calib_seq

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A hop through lines and token threads,
Calib set to five-twelve ahead.
Nemotron messages stitched in a row,
Scripts pass the number so pipelines know.
Bunny bounces—calibrate, then go! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately describes the primary change to use the Nemotron post-training dataset for calibration and references the relevant ticket identifier.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chenjiel/use_nemotron_dataset

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc222fb and 62f62f4.

📒 Files selected for processing (2)

CHANGELOG.rst (1 hunks)
examples/llm_ptq/hf_ptq.py (4 hunks)

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.rst

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (4)

examples/llm_ptq/hf_ptq.py (4)

299-307: Verify the performance impact of using two default datasets.

Defaulting to both cnn_dailymail and nemotron-post-training-dataset-v2 will increase calibration time compared to the previous single-dataset default. Additionally, the calib_size adjustment logic at lines 305-307 automatically extends or truncates calib_size to match the dataset count, which applies even when users explicitly provide datasets. For example, if a user specifies three datasets but only one calib_size value, the last value will be replicated for all remaining datasets.

Consider documenting this behavior change and the automatic adjustment logic to set clear expectations about calibration time and dataset handling.

</example_response>

358-358: LGTM!

Correctly threads args.calib_seq into the sparsity calibration dataloader, providing users explicit control over the maximum sample length for sparsification.

</review_comment_end -->

400-400: LGTM!

Appropriately uses args.calib_seq for batch size estimation, ensuring the calculated batch size reflects the specified maximum sequence length.

</review_comment_end -->

691-696: LGTM!

The new --calib_seq argument is well-defined with a clear help message and sensible default of 512.

</review_comment_end -->

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Chenjie Luo <[email protected]>

codecov · 2025-10-09T20:00:19Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.37%. Comparing base (5b02483) to head (62f62f4).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/utils/dataset_utils.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #420   +/-   ##
=======================================
  Coverage   73.36%   73.37%           
=======================================
  Files         180      180           
  Lines       17919    17934   +15     
=======================================
+ Hits        13147    13159   +12     
- Misses       4772     4775    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sugunav14 · 2025-10-09T21:35:17Z

CHANGELOG.rst

 - Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
 - Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
+- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.
+- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.


general question, what is the difference between calib_size and calib_seq?

calib_size is the num of calib samples.
calib_seq is the length of the max calib sample sequence.

nit:
@cjluo-nv should we use be more verbose and use --calib_seq_len for clarity?

ACK. Personally I prefer shorter flags.

examples/llm_ptq/scripts/parser.sh

kevalmorabia97 · 2025-10-10T18:50:08Z

CHANGELOG.rst

 - Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
 - Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
 - Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
+- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.


Suggested change

- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.

- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default changed from ``cnn_dailymail`` to ``nemotron-post-training-dataset-v2`` if no dataset is specified.

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv requested review from a team as code owners October 9, 2025 19:43

cjluo-nv requested review from meenchen, realAsma and sugunav14 October 9, 2025 19:43

Use nemotron post training dataset for calibration

dc222fb

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv force-pushed the chenjiel/use_nemotron_dataset branch from 86bcc39 to dc222fb Compare October 9, 2025 19:47

meenchen approved these changes Oct 9, 2025

View reviewed changes

sugunav14 reviewed Oct 9, 2025

View reviewed changes

kevalmorabia97 reviewed Oct 10, 2025

View reviewed changes

examples/llm_ptq/scripts/parser.sh Show resolved Hide resolved

kevalmorabia97 reviewed Oct 10, 2025

View reviewed changes

Default to a mix of cnn_dailymail and nemotron v2

62f62f4

Signed-off-by: Chenjie Luo <[email protected]>

cjluo-nv changed the title ~~Use nemotron post training dataset for calibration~~ [OMNIML-2791] Use nemotron post training dataset for calibration Oct 13, 2025

realAsma approved these changes Oct 14, 2025

View reviewed changes

cjluo-nv merged commit 32d168c into main Oct 14, 2025
30 of 32 checks passed

cjluo-nv deleted the chenjiel/use_nemotron_dataset branch October 14, 2025 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OMNIML-2791] Use nemotron post training dataset for calibration #420

[OMNIML-2791] Use nemotron post training dataset for calibration #420

Uh oh!

cjluo-nv commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

sugunav14 Oct 9, 2025

Uh oh!

cjluo-nv Oct 13, 2025

Uh oh!

realAsma Oct 14, 2025

Uh oh!

cjluo-nv Oct 14, 2025

Uh oh!

Uh oh!

kevalmorabia97 Oct 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to ``nemotron-post-training-dataset-v2`` if no dataset is specified.
	- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default changed from ``cnn_dailymail`` to ``nemotron-post-training-dataset-v2`` if no dataset is specified.

[OMNIML-2791] Use nemotron post training dataset for calibration #420

[OMNIML-2791] Use nemotron post training dataset for calibration #420

Uh oh!

Conversation

cjluo-nv commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sugunav14 Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

realAsma Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevalmorabia97 Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cjluo-nv commented Oct 9, 2025 •

edited

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading

kevalmorabia97 Oct 10, 2025 •

edited

Loading