Skip to content

Commit 7cbe5b9

Browse files
authored
Merge branch 'main' into jennifchen/cp_amax_sync
2 parents 264adbb + b4d6ced commit 7cbe5b9

File tree

22 files changed

+297
-288
lines changed

22 files changed

+297
-288
lines changed

.gitlab/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ unit:
2727
##### GPU Tests #####
2828
.multi-gpu-tests-default:
2929
extends: .tests-default
30-
timeout: 60m
30+
timeout: 90m
3131
image: nvcr.io/nvidia/pytorch:25.06-py3
3232
variables:
3333
GIT_DEPTH: 1000 # For correct version for tests/gpu/torch/quantization/plugins/test_megatron.py

CHANGELOG.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Model Optimizer Changelog (Linux)
88

99
- Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM or TensorRT docker image directly or refer to the `installation guide <https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html>`_ for more details.
1010
- Deprecated ``quantize_mode`` argument in ``examples/onnx_ptq/evaluate.py`` to support strongly typing. Use ``engine_precision`` instead.
11-
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. For performance evaluation, please use ``trtllm-bench`` directly.
11+
- Deprecated TRT-LLM's TRT backend in ``examples/llm_ptq`` and ``examples/vlm_ptq``. Tasks ``build`` and ``benchmark`` support are removed and replaced with ``quant``. ``engine_dir`` is replaced with ``checkpoint_dir`` in ``examples/llm_ptq`` and ``examples/vlm_ptq``. For performance evaluation, please use ``trtllm-bench`` directly.
1212
- ``--export_fmt`` flag in ``examples/llm_ptq`` is removed. By default we export to the unified Hugging Face checkpoint format.
1313
- Deprecated ``examples/vlm_eval`` as it depends on the deprecated TRT-LLM's TRT backend.
1414

@@ -17,6 +17,7 @@ Model Optimizer Changelog (Linux)
1717
- ``high_precision_dtype`` default to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.
1818
- Upgrade TensorRT-LLM dependency to 1.1.0rc2.
1919
- Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in ``examples/vlm_ptq``.
20+
- Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
2021
- Add Minitron pruning example for Megatron-LM framework. See ``examples/megatron-lm`` for more details.
2122

2223
0.35 (2025-09-04)

docs/source/guides/3_pruning.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Pruning
44

55
.. tip::
66

7-
Checkout `Llama 3.1 NeMo Minitron Pruning <https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation>`_ and
7+
Checkout `Qwen 3 NeMo Minitron Pruning & Distillation <https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation>`_ and
88
`ResNet20 on CIFAR-10 Notebook <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/pruning/cifar_resnet.ipynb>`_
99
for an end-to-end example of pruning.
1010

examples/llm_distill/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -144,7 +144,7 @@ Loss balancers:
144144

145145
Checkout the stand-alone distillation script in the [NVIDIA NeMo repository](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).
146146

147-
You can also look at the tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework.
147+
You can also look at the NeMo tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/qwen/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Qwen 3 8B step-by-step in NeMo framework. Hugging Face models can also be converted to NeMo format and used subsequently as shown in the tutorial.
148148

149149
## Knowledge Distillation (KD) for HuggingFace Models
150150

examples/llm_eval/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ If `trust_remote_code` needs to be true, please append the command with the `--t
9393
### TensorRT-LLM
9494

9595
```sh
96-
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,engine_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <engine batch size>
96+
python lm_eval_tensorrt_llm.py --model trt-llm --model_args tokenizer=<HF model folder>,checkpoint_dir=<Quantized checkpoint dir> --tasks <comma separated tasks> --batch_size <max batch size>
9797
```
9898

9999
## MMLU
@@ -137,10 +137,10 @@ python mmlu.py --model_name causal --model_path <HF model folder or model card>
137137
python mmlu.py --model_name causal --model_path <HF model folder or model card> --quant_cfg $MODELOPT_QUANT_CFG_TO_SEARCH --auto_quantize_bits $EFFECTIVE_BITS --batch_size 4
138138
```
139139

140-
### Evaluate the TensorRT-LLM engine
140+
### Evaluate with TensorRT-LLM
141141

142142
```bash
143-
python mmlu.py --model_name causal --model_path <HF model folder or model card> --engine_dir <Quantized checkpoint dir>
143+
python mmlu.py --model_name causal --model_path <HF model folder or model card> --checkpoint_dir <Quantized checkpoint dir>
144144
```
145145

146146
## MT-Bench
@@ -160,7 +160,7 @@ bash run_fastchat.sh -h <HF model folder or model card>
160160
bash run_fastchat.sh -h <HF model folder or model card> --quant_cfg MODELOPT_QUANT_CFG
161161
```
162162

163-
### Evaluate the TensorRT-LLM engine
163+
### Evaluate with TensorRT-LLM
164164

165165
```bash
166166
bash run_fastchat.sh -h <HF model folder or model card> <Quantized checkpoint dir>

examples/llm_eval/gen_model_answer.py

Lines changed: 19 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ def run_eval(
118118
max_gpu_memory,
119119
dtype,
120120
revision,
121-
engine_dir,
121+
checkpoint_dir,
122122
nim_model,
123123
args,
124124
):
@@ -150,7 +150,7 @@ def run_eval(
150150
revision=revision,
151151
top_p=top_p,
152152
temperature=temperature,
153-
engine_dir=engine_dir,
153+
checkpoint_dir=checkpoint_dir,
154154
nim_model=nim_model,
155155
)
156156
for i in range(0, len(questions), chunk_size)
@@ -174,25 +174,22 @@ def get_model_answers(
174174
revision,
175175
top_p=None,
176176
temperature=None,
177-
engine_dir=None,
177+
checkpoint_dir=None,
178178
nim_model=None,
179179
):
180180
# Model Optimizer modification
181-
if engine_dir:
182-
tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
183-
if engine_dir:
184-
# get model type
185-
last_part = os.path.basename(engine_dir)
186-
model_type = last_part.split("_")[0]
187-
# Some models require to set pad_token and eos_token based on external config (e.g., qwen)
188-
if model_type == "qwen":
189-
tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
190-
tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)
191-
192-
assert LLM is not None, "tensorrt_llm APIs could not be imported."
193-
model = LLM(engine_dir, tokenizer=tokenizer)
194-
else:
195-
raise ValueError("engine_dir is required for TensorRT LLM inference.")
181+
tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
182+
if checkpoint_dir:
183+
# get model type
184+
last_part = os.path.basename(checkpoint_dir)
185+
model_type = last_part.split("_")[0]
186+
# Some models require to set pad_token and eos_token based on external config (e.g., qwen)
187+
if model_type == "qwen":
188+
tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
189+
tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)
190+
191+
assert LLM is not None, "tensorrt_llm APIs could not be imported."
192+
model = LLM(checkpoint_dir, tokenizer=tokenizer)
196193
elif not nim_model:
197194
model, _ = load_model(
198195
model_path,
@@ -205,7 +202,6 @@ def get_model_answers(
205202
cpu_offloading=False,
206203
debug=False,
207204
)
208-
tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
209205
if args.quant_cfg:
210206
quantize_model(
211207
model,
@@ -259,7 +255,7 @@ def get_model_answers(
259255

260256
# some models may error out when generating long outputs
261257
try:
262-
if not engine_dir:
258+
if not checkpoint_dir:
263259
output_ids = model.generate(
264260
torch.as_tensor(input_ids).cuda(),
265261
do_sample=do_sample,
@@ -427,9 +423,9 @@ def reorg_answer_file(answer_file):
427423
help="The model revision to load.",
428424
)
429425
parser.add_argument(
430-
"--engine-dir",
426+
"--checkpoint-dir",
431427
type=str,
432-
help="The path to the TensorRT LLM engine directory.",
428+
help="The path to the model checkpoint directory.",
433429
)
434430
parser.add_argument(
435431
"--nim-model",
@@ -502,7 +498,7 @@ def reorg_answer_file(answer_file):
502498
max_gpu_memory=args.max_gpu_memory,
503499
dtype=str_to_torch_dtype(args.dtype),
504500
revision=args.revision,
505-
engine_dir=args.engine_dir,
501+
checkpoint_dir=args.checkpoint_dir,
506502
nim_model=args.nim_model,
507503
args=args,
508504
)

examples/llm_eval/lm_eval_tensorrt_llm.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ class TRTLLM(TemplateAPI):
4242
def __init__(
4343
self,
4444
tokenizer: str,
45-
engine_dir: str,
45+
checkpoint_dir: str,
4646
batch_size: int = 1,
4747
**kwargs,
4848
):
@@ -56,11 +56,11 @@ def __init__(
5656
if self.tokenizer.pad_token_id is None:
5757
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
5858

59-
assert isinstance(engine_dir, str)
59+
assert isinstance(checkpoint_dir, str)
6060

61-
self.llm = LLM(checkpoint_dir=engine_dir, tokenizer=self.tokenizer)
61+
self.llm = LLM(checkpoint_dir=checkpoint_dir, tokenizer=self.tokenizer)
6262
self.max_length = self.llm.max_seq_len - 1
63-
logger.info("Loaded TRT-LLM engine")
63+
logger.info("Loaded TRT-LLM")
6464

6565
def model_call(
6666
self,

examples/llm_eval/mmlu.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -252,9 +252,9 @@ def main(
252252
mto.enable_huggingface_checkpointing()
253253
model_path = kwargs["model_path"]
254254
tokenizer = get_tokenizer(model_path, trust_remote_code=kwargs.get("trust_remote_code", False))
255-
if kwargs.get("engine_dir"):
255+
if kwargs.get("checkpoint_dir"):
256256
# get model type
257-
last_part = os.path.basename(kwargs["engine_dir"])
257+
last_part = os.path.basename(kwargs["checkpoint_dir"])
258258
model_type = last_part.split("_")[0]
259259
# Some models require to set pad_token and eos_token based on external config (e.g., qwen)
260260
if model_type == "qwen":
@@ -264,7 +264,9 @@ def main(
264264
assert LLM is not None, "tensorrt_llm APIs could not be imported."
265265
medusa_choices = kwargs.get("medusa_choices")
266266
model = LLM(
267-
checkpoint_dir=kwargs["engine_dir"], tokenizer=tokenizer, medusa_choices=medusa_choices
267+
checkpoint_dir=kwargs["checkpoint_dir"],
268+
tokenizer=tokenizer,
269+
medusa_choices=medusa_choices,
268270
)
269271
else:
270272
model = select_model(

examples/llm_eval/run_fastchat.sh

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,18 @@
2020
# If you are using NIM, ensure that you export the NIM API key using:
2121
# export OPENAI_API_KEY=<NIM_API_KEY>
2222
#
23-
# Usage: bash run_fastchat.sh -h <HF model folder or model card> -e <engine_dir> -n <NIM model model card>
23+
# Usage: bash run_fastchat.sh -h <HF model folder or model card> -e <checkpoint_dir> -n <NIM model model card>
2424
# model_name: The HuggingFace handle or folder of the model to evaluate.
25-
# engine_dir: The directory where the TRT-LLM engine is stored.
25+
# checkpoint_dir: The directory where the checkpoint is stored.
2626
# nim_model_name: The handle of the NIM model to be used for evaluation.
2727
#
2828
# Example commands:
2929
#
3030
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model:
3131
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct
3232
#
33-
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model with TRT-LLM engine:
34-
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct -e /path/to/engine_dir
33+
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model with TRT-LLM:
34+
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct -e /path/to/checkpoint_dir
3535
#
3636
# Evaluate "meta-llama/Meta-Llama-3-8B-Instruct" HF model with NIM:
3737
# bash run_fastchat.sh -h meta-llama/Meta-Llama-3-8B-Instruct -n meta-llama/Meta-Llama-3-8B-Instruct
@@ -41,7 +41,7 @@ set -e
4141
set -x
4242

4343
hf_model_name=""
44-
engine_dir=""
44+
checkpoint_dir=""
4545
nim_model_name=""
4646
answer_file=""
4747
quant_cfg=""
@@ -56,9 +56,9 @@ while [[ "$1" != "" ]]; do
5656
shift
5757
hf_model_name=$1
5858
;;
59-
-e | --engine_dir )
59+
-e | --checkpoint_dir )
6060
shift
61-
engine_dir=$1
61+
checkpoint_dir=$1
6262
;;
6363
-n | --nim_model_name )
6464
shift
@@ -96,8 +96,8 @@ if [ "$hf_model_name" == "" ]; then
9696
exit 1
9797
fi
9898

99-
if [ "$engine_dir" != "" ]; then
100-
engine_dir=" --engine-dir $engine_dir "
99+
if [ "$checkpoint_dir" != "" ]; then
100+
checkpoint_dir=" --checkpoint-dir $checkpoint_dir "
101101
fi
102102

103103
if [ "$nim_model_name" != "" ]; then
@@ -143,7 +143,7 @@ PYTHONPATH=FastChat:$PYTHONPATH python gen_model_answer.py \
143143
--model-id $hf_model_name \
144144
--temperature 0.0001 \
145145
--top-p 0.0001 \
146-
$engine_dir \
146+
$checkpoint_dir \
147147
$nim_model_name \
148148
$answer_file \
149149
$quant_args

examples/llm_ptq/example_utils.py

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -36,16 +36,6 @@ def is_speculative(hf_config):
3636
)
3737

3838

39-
def get_mode_type_from_engine_dir(engine_dir_str):
40-
# Split the path by '/' and get the last part
41-
last_part = os.path.basename(engine_dir_str)
42-
43-
# Split the last part by '_' and get the first segment
44-
model_type = last_part.split("_")[0]
45-
46-
return model_type
47-
48-
4939
def get_tokenizer(ckpt_path, trust_remote_code=False, **kwargs):
5040
print(f"Initializing tokenizer from {ckpt_path}")
5141

0 commit comments

Comments
 (0)