Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,5 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
/examples/vlm_eval @NVIDIA/modelopt-examples-vlm-codeowners
/examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
/examples/windows @NVIDIA/modelopt-windows-codeowners
10 changes: 7 additions & 3 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,17 @@ Model Optimizer Changelog (Linux)
^^^^^^^^^^^^^^^^^

**Deprecations**
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.

**Bug Fixes**
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
- ``int8_sq`` quantization format is deprecated from the ``examples/vlm_ptq`` with respect to the TensorRT-LLM's torch backend switch. Please refer to the previous releases if this quantization format is needed.
- Deprecated ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.

**New Features**

- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.

0.35 (2025-09-04)
^^^^^^^^^^^^^^^^^
Expand All @@ -23,7 +28,6 @@ Model Optimizer Changelog (Linux)
**Bug Fixes**

- Fix attention head ranking logic for pruning Megatron Core GPT models.
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.

**New Features**

Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/_installation_for_Linux.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
+-------------------------+-----------------------------+
| PyTorch | >=2.6 |
+-------------------------+-----------------------------+
| TensorRT-LLM (Optional) | 1.0.0rc6 |
| TensorRT-LLM (Optional) | 1.1.0rc2.post2 |
+-------------------------+-----------------------------+
| ONNX Runtime (Optional) | 1.22 |
+-------------------------+-----------------------------+
Expand Down
2 changes: 1 addition & 1 deletion examples/gpt-oss/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ datasets
deepspeed
kernels>=0.9.0
peft>=0.17.0
torch >= 2.8.0
torch>2.7.1
trackio
transformers>=4.55.0
trl>=0.21.0
6 changes: 3 additions & 3 deletions examples/llm_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ If `trust_remote_code` needs to be true, please append the command with the `--t
### TensorRT-LLM

```sh
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<TRT LLM engine dir> --tasks <comma separated tasks> --batch_size <engine batch size>
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <engine batch size>
```

## MMLU
Expand Down Expand Up @@ -140,7 +140,7 @@ python mmlu.py --model_name causal --model_path <HF model folder or model card>
### Evaluate the TensorRT-LLM engine

```bash
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <built TensorRT-LLM folder>
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <Quantized checkpoint dir>
```

## MT-Bench
Expand All @@ -163,7 +163,7 @@ bash run_fastchat.sh -h <HF model folder or model card> --quant_cfg MODELOPT_QUA
### Evaluate the TensorRT-LLM engine

```bash
bash run_fastchat.sh -h <HF model folder or model card> <built TensorRT-LLM folder>
bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>
```

### Judging the responses
Expand Down
8 changes: 4 additions & 4 deletions examples/llm_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ scripts/huggingface_example.sh --type llama --model $HF_PATH --quant w4a8_awq,fp
The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).

The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `build,mmlu,benchmark,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `quant,mmlu,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).

> *If GPU out-of-memory error is reported running the scripts, please try editing the scripts and reducing the max batch size to save GPU memory.*

Expand Down Expand Up @@ -251,7 +251,7 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_

> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.*

> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Ensure --low_memory_mode is implemented and gated as documented.

README states it’s only supported for FP8/NVFP4 with max calibration. Verify hf_ptq.py enforces this and that weight compression happens before calibration.


🏁 Script executed:

#!/usr/bin/env bash
# Confirm flag definition and usage constraints
rg -n -C2 --type=py 'low_memory_mode|--low_memory_mode' examples/llm_ptq/hf_ptq.py
rg -n -C3 --type=py 'calibrate\(|mtq\.calibrate|init_quantized_weights|NVFP4|FP8' examples/llm_ptq/hf_ptq.py

Length of output: 2446


Enforce --low_memory_mode gating and compress weights before calibration

README says low_memory_mode is only supported for FP8 / NVFP4 with max calibration. The flag is defined and referenced but there is no explicit enforcement or obvious compression-before-calibration:

  • examples/llm_ptq/hf_ptq.py:730–731 (parser.add_argument for --low_memory_mode) and 232 (if not args.low_memory_mode).
  • Calibration calls: 138–142 (mtq.calibrate(..., algorithm="max")) and 186–189 (mtq.calibrate / mtq.quantize branch).
  • init_quantized_weights context: 252–255 (used around model loading/calibration).

Action items:

  • Validate and error out if args.low_memory_mode is set while the chosen quant config is not FP8 or NVFP4, or the calibration algorithm is not "max".
  • Ensure weight compression path for low_memory_mode runs before any mtq.calibrate / mtq.quantize calls (move compression into the HF checkpoint load path used when args.low_memory_mode is True).
  • Add a small test or assertion exercising low_memory_mode to fail fast on unsupported configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question: I remember we had some accuracy degradation with --low_memory_mode - is that correct?


#### Deepseek R1

Expand Down Expand Up @@ -301,15 +301,15 @@ with torch.inference_mode():
### Quantize and Export

```bash
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_fmt hf --export_path <quantized_ckpt_path> --trust_remote_code
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code
```
Comment on lines 301 to 305
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Command examples: confirm defaults now export unified HF without extra flags.

Looks good; verify hf_ptq.py and huggingface_example.sh default to unified HF format and no longer accept --export_fmt.

Also applies to: 311-313


🏁 Script executed:

#!/usr/bin/env bash
# Check CLI surfaces for --export_fmt remnants and default export path
rg -n --type=sh --type=py 'export_fmt|export_hf_checkpoint|export_tensorrt_llm_checkpoint' examples/llm_ptq/{hf_ptq.py,scripts/huggingface_example.sh}

Length of output: 474


Remove deprecated --export_fmt from hf_ptq CLI and enforce HF-only export

examples/llm_ptq/hf_ptq.py still defines/uses --export_fmt (arg at L691; forced/warned at L750–751) and has related references at L40–41 and call sites at L594/L607 — remove the CLI flag and deprecation handling, and update docs/examples accordingly (examples/llm_ptq/scripts/huggingface_example.sh: no matches).

🤖 Prompt for AI Agents
In examples/llm_ptq/README.md around lines 301–305, update the docs and code to
remove the deprecated --export_fmt CLI flag and enforce HuggingFace-only export:
remove the --export_fmt argument definition in examples/llm_ptq/hf_ptq.py
(around L691), delete any deprecation warnings/handling (around L750–751) and
references at the top of the file (L40–41), update or remove call sites that
pass --export_fmt (L594 and L607), and adjust
examples/llm_ptq/scripts/huggingface_example.sh and README usage lines to call
hf_ptq without --export_fmt and to document only HF export (e.g., keep
--export_path and --trust_remote_code). Ensure any code paths or conditionals
that previously branched on export_fmt are simplified to HF-only export and
remove related tests or grep matches.


### Hugging Face framework [Script](./scripts/huggingface_example.sh)

Alternatively, the framework script `huggingface_example.sh` also supports quantize and export:

```bash
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8
```

### Deployment
Expand Down
117 changes: 55 additions & 62 deletions examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,28 +89,20 @@ def auto_quantize(
qformat_list = qformat.split(",")
assert qformat_list, "No quantization formats provided"
# Check if all provided quantization formats are supported
if args.export_fmt == "hf":
assert all(
qformat
in [
"fp8",
"int4_awq",
"nvfp4",
"nvfp4_awq",
"w4a8_awq",
"fp8_pb_wo",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
]
for qformat in qformat_list
), (
"One or more quantization formats provided are not supported for unified checkpoint export"
)
else:
assert all(
qformat in ["fp8", "int8_sq", "int4_awq", "w4a8_awq", "nvfp4", "nvfp4_awq"]
for qformat in qformat_list
), "One or more quantization formats provided are not supported for tensorrt llm export"
assert all(
qformat
in [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can pull this list of supported qformats as a variable and reuse in other places (in the auto quantize section)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. This PR is pretty large. I hope we can move the improvements to a following up

"fp8",
"int4_awq",
"nvfp4",
"nvfp4_awq",
"w4a8_awq",
"fp8_pb_wo",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
]
for qformat in qformat_list
), "One or more quantization formats provided are not supported for unified checkpoint export"

def loss_func(output, data):
# For transformers AutoModelForCausalLM models, the outputs are wrapped in `CausalLMOutputWithPast`
Expand Down Expand Up @@ -219,27 +211,21 @@ def main(args):
"Quantization supports only one quantization format."
)

# Check arguments for unified_hf export format and set to default if unsupported arguments are provided
if args.export_fmt == "hf":
assert args.sparsity_fmt == "dense", (
f"Sparsity format {args.sparsity_fmt} not supported by unified export api."
)

if not args.auto_quantize_bits:
assert (
args.qformat
in [
"int4_awq",
"fp8",
"nvfp4",
"nvfp4_awq",
"w4a8_awq",
"fp8_pb_wo",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
]
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
), f"Quantization format {args.qformat} not supported for HF export path"
if not args.auto_quantize_bits:
assert (
args.qformat
in [
"int4_awq",
"fp8",
"nvfp4",
"nvfp4_awq",
"w4a8_awq",
"fp8_pb_wo",
"w4a8_mxfp4_fp8",
"nvfp4_mlp_only",
]
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
), f"Quantization format {args.qformat} not supported for HF export path"

Comment on lines +214 to 229
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Assertion is ineffective; or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES is always True.

This currently lets unsupported qformats through. Tighten the check.

-    if not args.auto_quantize_bits:
-        assert (
-            args.qformat
-            in [
-                "int4_awq",
-                "fp8",
-                "nvfp4",
-                "nvfp4_awq",
-                "w4a8_awq",
-                "fp8_pb_wo",
-                "w4a8_mxfp4_fp8",
-                "nvfp4_mlp_only",
-            ]
-            or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
-        ), f"Quantization format {args.qformat} not supported for HF export path"
+    if not args.auto_quantize_bits:
+        assert args.qformat in ALLOWED_UNIFIED_HF_QFORMATS, (
+            f"Quantization format {args.qformat} not supported for HF export path"
+        )

Note: If you intended to allow “KV-cache-only” quant, handle that as a separate branch instead of weakening this assert.

Committable suggestion skipped: line range outside the PR's diff.

# If low memory mode is enabled, we compress the model while loading the HF checkpoint.
calibration_only = False
Expand All @@ -253,9 +239,6 @@ def main(args):
attn_implementation=args.attn_implementation,
)
else:
assert args.export_fmt == "hf", (
"Low memory mode is only supported for exporting HF checkpoint."
)
assert args.qformat in QUANT_CFG_CHOICES, (
f"Quantization format is not supported for low memory mode. Supported formats: {QUANT_CFG_CHOICES.keys()}"
)
Expand Down Expand Up @@ -600,34 +583,41 @@ def output_decode(generated_ids, input_shape):
setattr(model.config, "architectures", full_model_config.architectures)

start_time = time.time()
if args.export_fmt == "tensorrt_llm":
if (
model_type in ["t5", "bart", "whisper"]
or args.sparsity_fmt != "dense"
or "int8_sq" in args.qformat
):
warnings.warn(
"Still exporting TensorRT-LLM checkpoints for models not supported by the TensorRT-LLM torch runtime."
)

# Move meta tensor back to device before exporting.
remove_hook_from_module(model, recurse=True)

dtype = None
if "w4a8_awq" in args.qformat:
# TensorRT-LLM w4a8 only support fp16 as the dtype.
dtype = torch.float16

# For Gemma2-27B, TRT-LLM only works with bfloat16 as the dtype.
if model_type == "gemma2":
dtype = torch.bfloat16

export_tensorrt_llm_checkpoint(
model,
model_type,
dtype=dtype,
export_dir=export_path,
inference_tensor_parallel=args.inference_tensor_parallel,
inference_pipeline_parallel=args.inference_pipeline_parallel,
)
elif args.export_fmt == "hf":
else:
# Check arguments for unified_hf export format and set to default if unsupported arguments are provided
assert args.sparsity_fmt == "dense", (
f"Sparsity format {args.sparsity_fmt} not supported by unified export api."
)

if args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1:
warnings.warn(
"Unified HF export format does not specify inference tensor parallel or pipeline parallel. "
"They will be set at deployment time."
)

export_hf_checkpoint(
full_model,
export_dir=export_path,
)
else:
raise NotImplementedError(f"{args.export_fmt} not supported")

# Restore default padding and export the tokenizer as well.
if tokenizer is not None:
Expand Down Expand Up @@ -710,9 +700,9 @@ def output_decode(generated_ids, input_shape):
parser.add_argument(
"--export_fmt",
required=False,
default="tensorrt_llm",
default="hf",
choices=["tensorrt_llm", "hf"],
help=("Checkpoint export format"),
help="Deprecated. Please avoid using this argument.",
)
parser.add_argument(
"--trust_remote_code",
Expand Down Expand Up @@ -767,6 +757,9 @@ def output_decode(generated_ids, input_shape):

args = parser.parse_args()

if args.export_fmt != "hf":
warnings.warn("Deprecated. --export_fmt forced to hf.")

args.dataset = args.dataset.split(",") if args.dataset else None
args.calib_size = [int(num_sample) for num_sample in args.calib_size.split(",")]
main(args)
Loading
Loading