Skip to content

Commit 146d1d9

Browse files
cjluo-nvChenjie Luokevalmorabia97
authored
Deprecate TRTLLM-build in examples (#297)
Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Keval Morabia <[email protected]> Co-authored-by: Chenjie Luo <[email protected]> Co-authored-by: Keval Morabia <[email protected]>
1 parent 5db7169 commit 146d1d9

32 files changed

+269
-1717
lines changed

.github/CODEOWNERS

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,5 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
5050
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
5151
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
5252
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
53-
/examples/vlm_eval @NVIDIA/modelopt-examples-vlm-codeowners
5453
/examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
5554
/examples/windows @NVIDIA/modelopt-windows-codeowners

CHANGELOG.rst

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,17 @@ Model Optimizer Changelog (Linux)
55
^^^^^^^^^^^^^^^^^
66

77
**Deprecations**
8-
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
98

10-
**Bug Fixes**
9+
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
10+
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
11+
- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
12+
- ``int8_sq`` quantization format is deprecated from the ``examples/vlm_ptq`` with respect to the TensorRT-LLM's torch backend switch. Please refer to the previous releases if this quantization format is needed.
13+
- Deprecated ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.
1114

1215
**New Features**
16+
1317
- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
18+
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
1419

1520
0.35 (2025-09-04)
1621
^^^^^^^^^^^^^^^^^
@@ -23,7 +28,6 @@ Model Optimizer Changelog (Linux)
2328
**Bug Fixes**
2429

2530
- Fix attention head ranking logic for pruning Megatron Core GPT models.
26-
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
2731

2832
**New Features**
2933

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1818
+-------------------------+-----------------------------+
1919
| PyTorch | >=2.6 |
2020
+-------------------------+-----------------------------+
21-
| TensorRT-LLM (Optional) | 1.0.0rc6 |
21+
| TensorRT-LLM (Optional) | 1.1.0rc2.post2 |
2222
+-------------------------+-----------------------------+
2323
| ONNX Runtime (Optional) | 1.22 |
2424
+-------------------------+-----------------------------+

examples/gpt-oss/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ datasets
33
deepspeed
44
kernels>=0.9.0
55
peft>=0.17.0
6-
torch >= 2.8.0
6+
torch>2.7.1
77
trackio
88
transformers>=4.55.0
99
trl>=0.21.0

examples/llm_eval/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ If `trust_remote_code` needs to be true, please append the command with the `--t
9393
### TensorRT-LLM
9494

9595
```sh
96-
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<TRT LLM engine dir> --tasks <comma separated tasks> --batch_size <engine batch size>
96+
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <engine batch size>
9797
```
9898

9999
## MMLU
@@ -140,7 +140,7 @@ python mmlu.py --model_name causal --model_path <HF model folder or model card>
140140
### Evaluate the TensorRT-LLM engine
141141

142142
```bash
143-
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <built TensorRT-LLM folder>
143+
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <Quantized checkpoint dir>
144144
```
145145

146146
## MT-Bench
@@ -163,7 +163,7 @@ bash run_fastchat.sh -h <HF model folder or model card> --quant_cfg MODELOPT_QUA
163163
### Evaluate the TensorRT-LLM engine
164164

165165
```bash
166-
bash run_fastchat.sh -h <HF model folder or model card> <built TensorRT-LLM folder>
166+
bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>
167167
```
168168

169169
### Judging the responses

examples/llm_ptq/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -203,7 +203,7 @@ scripts/huggingface_example.sh --type llama --model $HF_PATH --quant w4a8_awq,fp
203203
The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
204204
are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
205205

206-
The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `build,mmlu,benchmark,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
206+
The example scripts above also have an additional flag `--tasks`, where the actual tasks run in the script can be customized. The allowed tasks are `quant,mmlu,lm_eval,livecodebench` specified in the script [parser](./scripts/parser.sh). The tasks combo can be specified with a comma-separated task list. Some tasks like mmlu can take a long time to run. To run lm_eval tasks, please also specify the `--lm_eval_tasks` flag with comma separated lm_eval tasks [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks).
207207

208208
> *If GPU out-of-memory error is reported running the scripts, please try editing the scripts and reducing the max batch size to save GPU memory.*
209209
@@ -251,7 +251,7 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_
251251
252252
> *If a GPU OOM error occurs during model quantization despite sufficient memory, setting the --use_seq_device_map flag can help. This enforces sequential device mapping, distributing the model across GPUs and utilizing up to 80% of each GPU's memory.*
253253
254-
> *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
254+
> *You can add `--low_memory_mode` to the command to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
255255
256256
#### Deepseek R1
257257

@@ -301,15 +301,15 @@ with torch.inference_mode():
301301
### Quantize and Export
302302

303303
```bash
304-
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_fmt hf --export_path <quantized_ckpt_path> --trust_remote_code
304+
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code
305305
```
306306

307307
### Hugging Face framework [Script](./scripts/huggingface_example.sh)
308308

309309
Alternatively, the framework script `huggingface_example.sh` also supports quantize and export:
310310

311311
```bash
312-
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf
312+
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8
313313
```
314314

315315
### Deployment

examples/llm_ptq/hf_ptq.py

Lines changed: 55 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -89,28 +89,20 @@ def auto_quantize(
8989
qformat_list = qformat.split(",")
9090
assert qformat_list, "No quantization formats provided"
9191
# Check if all provided quantization formats are supported
92-
if args.export_fmt == "hf":
93-
assert all(
94-
qformat
95-
in [
96-
"fp8",
97-
"int4_awq",
98-
"nvfp4",
99-
"nvfp4_awq",
100-
"w4a8_awq",
101-
"fp8_pb_wo",
102-
"w4a8_mxfp4_fp8",
103-
"nvfp4_mlp_only",
104-
]
105-
for qformat in qformat_list
106-
), (
107-
"One or more quantization formats provided are not supported for unified checkpoint export"
108-
)
109-
else:
110-
assert all(
111-
qformat in ["fp8", "int8_sq", "int4_awq", "w4a8_awq", "nvfp4", "nvfp4_awq"]
112-
for qformat in qformat_list
113-
), "One or more quantization formats provided are not supported for tensorrt llm export"
92+
assert all(
93+
qformat
94+
in [
95+
"fp8",
96+
"int4_awq",
97+
"nvfp4",
98+
"nvfp4_awq",
99+
"w4a8_awq",
100+
"fp8_pb_wo",
101+
"w4a8_mxfp4_fp8",
102+
"nvfp4_mlp_only",
103+
]
104+
for qformat in qformat_list
105+
), "One or more quantization formats provided are not supported for unified checkpoint export"
114106

115107
def loss_func(output, data):
116108
# For transformers AutoModelForCausalLM models, the outputs are wrapped in `CausalLMOutputWithPast`
@@ -219,27 +211,21 @@ def main(args):
219211
"Quantization supports only one quantization format."
220212
)
221213

222-
# Check arguments for unified_hf export format and set to default if unsupported arguments are provided
223-
if args.export_fmt == "hf":
224-
assert args.sparsity_fmt == "dense", (
225-
f"Sparsity format {args.sparsity_fmt} not supported by unified export api."
226-
)
227-
228-
if not args.auto_quantize_bits:
229-
assert (
230-
args.qformat
231-
in [
232-
"int4_awq",
233-
"fp8",
234-
"nvfp4",
235-
"nvfp4_awq",
236-
"w4a8_awq",
237-
"fp8_pb_wo",
238-
"w4a8_mxfp4_fp8",
239-
"nvfp4_mlp_only",
240-
]
241-
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
242-
), f"Quantization format {args.qformat} not supported for HF export path"
214+
if not args.auto_quantize_bits:
215+
assert (
216+
args.qformat
217+
in [
218+
"int4_awq",
219+
"fp8",
220+
"nvfp4",
221+
"nvfp4_awq",
222+
"w4a8_awq",
223+
"fp8_pb_wo",
224+
"w4a8_mxfp4_fp8",
225+
"nvfp4_mlp_only",
226+
]
227+
or args.kv_cache_qformat in KV_QUANT_CFG_CHOICES
228+
), f"Quantization format {args.qformat} not supported for HF export path"
243229

244230
# If low memory mode is enabled, we compress the model while loading the HF checkpoint.
245231
calibration_only = False
@@ -253,9 +239,6 @@ def main(args):
253239
attn_implementation=args.attn_implementation,
254240
)
255241
else:
256-
assert args.export_fmt == "hf", (
257-
"Low memory mode is only supported for exporting HF checkpoint."
258-
)
259242
assert args.qformat in QUANT_CFG_CHOICES, (
260243
f"Quantization format is not supported for low memory mode. Supported formats: {QUANT_CFG_CHOICES.keys()}"
261244
)
@@ -600,34 +583,41 @@ def output_decode(generated_ids, input_shape):
600583
setattr(model.config, "architectures", full_model_config.architectures)
601584

602585
start_time = time.time()
603-
if args.export_fmt == "tensorrt_llm":
586+
if (
587+
model_type in ["t5", "bart", "whisper"]
588+
or args.sparsity_fmt != "dense"
589+
or "int8_sq" in args.qformat
590+
):
591+
warnings.warn(
592+
"Still exporting TensorRT-LLM checkpoints for models not supported by the TensorRT-LLM torch runtime."
593+
)
594+
604595
# Move meta tensor back to device before exporting.
605596
remove_hook_from_module(model, recurse=True)
606597

607-
dtype = None
608-
if "w4a8_awq" in args.qformat:
609-
# TensorRT-LLM w4a8 only support fp16 as the dtype.
610-
dtype = torch.float16
611-
612-
# For Gemma2-27B, TRT-LLM only works with bfloat16 as the dtype.
613-
if model_type == "gemma2":
614-
dtype = torch.bfloat16
615-
616598
export_tensorrt_llm_checkpoint(
617599
model,
618600
model_type,
619-
dtype=dtype,
620601
export_dir=export_path,
621602
inference_tensor_parallel=args.inference_tensor_parallel,
622603
inference_pipeline_parallel=args.inference_pipeline_parallel,
623604
)
624-
elif args.export_fmt == "hf":
605+
else:
606+
# Check arguments for unified_hf export format and set to default if unsupported arguments are provided
607+
assert args.sparsity_fmt == "dense", (
608+
f"Sparsity format {args.sparsity_fmt} not supported by unified export api."
609+
)
610+
611+
if args.inference_tensor_parallel != 1 or args.inference_pipeline_parallel != 1:
612+
warnings.warn(
613+
"Unified HF export format does not specify inference tensor parallel or pipeline parallel. "
614+
"They will be set at deployment time."
615+
)
616+
625617
export_hf_checkpoint(
626618
full_model,
627619
export_dir=export_path,
628620
)
629-
else:
630-
raise NotImplementedError(f"{args.export_fmt} not supported")
631621

632622
# Restore default padding and export the tokenizer as well.
633623
if tokenizer is not None:
@@ -710,9 +700,9 @@ def output_decode(generated_ids, input_shape):
710700
parser.add_argument(
711701
"--export_fmt",
712702
required=False,
713-
default="tensorrt_llm",
703+
default="hf",
714704
choices=["tensorrt_llm", "hf"],
715-
help=("Checkpoint export format"),
705+
help="Deprecated. Please avoid using this argument.",
716706
)
717707
parser.add_argument(
718708
"--trust_remote_code",
@@ -767,6 +757,9 @@ def output_decode(generated_ids, input_shape):
767757

768758
args = parser.parse_args()
769759

760+
if args.export_fmt != "hf":
761+
warnings.warn("Deprecated. --export_fmt forced to hf.")
762+
770763
args.dataset = args.dataset.split(",") if args.dataset else None
771764
args.calib_size = [int(num_sample) for num_sample in args.calib_size.split(",")]
772765
main(args)

0 commit comments

Comments
 (0)