Skip to content

Commit 63d1b53

Browse files
authored
Merge branch 'main' into kaix/fsdp_fix
2 parents 49340c8 + bc54694 commit 63d1b53

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2922
-157
lines changed

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
2222
modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
2323
modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
2424
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
25+
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
2526
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
2627
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
2728
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners

.gitlab/tests.yml

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,20 +54,12 @@ example-torch:
5454
timeout: 30m
5555
parallel:
5656
matrix:
57-
- EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
57+
- EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
5858
script:
5959
- pip install ".[hf,dev-test]"
6060
- find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
6161
- pytest -s tests/examples/$EXAMPLE
6262

63-
# TODO: Fix llm_qat test hang in GitLab CI
64-
example-failing:
65-
extends: example-torch
66-
allow_failure: true
67-
parallel:
68-
matrix:
69-
- EXAMPLE: [llm_qat]
70-
7163
example-trtllm:
7264
extends: example-torch
7365
timeout: 60m

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Model Optimizer Changelog (Linux)
99
**New Features**
1010

1111
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
12+
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
1213
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
1314

1415
0.37 (2025-09-xx)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
2626

2727
## Latest News
2828

29+
- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
2930
- [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
3031
- [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
3132
- [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)

docs/source/deployment/1_tensorrt_llm.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22
TensorRT-LLM
33
==========================
44

5+
**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
6+
57
.. note::
68

7-
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
9+
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
810
first before going through this section.
911

1012

13+
1114
ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
1215

1316
This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
144147
Convert to TensorRT-LLM
145148
=======================
146149

147-
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
150+
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.

docs/source/guides/7_nas.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
635635
increased training time.
636636
- May provide similar performance to NAS in particular applications, however, usually exhibits
637637
worse performance due to the limited search space and training time.
638+
639+
640+
[Advanced] Adding a new NAS/Prune Algorithm
641+
===========================================
642+
643+
* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_
644+
for adding a new NAS algorithm.
645+
* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
646+
for an actual example of adding Minitron Pruning algorithm.

examples/llm_distill/main.py

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,11 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
import logging
1716
import os
1817
from dataclasses import dataclass
1918

19+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
20+
2021
import datasets
2122
import torch
2223
import torch.distributed
@@ -29,10 +30,7 @@
2930
import modelopt.torch.opt as mto
3031
from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
3132

32-
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
33-
34-
logger = get_logger(__name__)
35-
logging.basicConfig(level=logging.INFO)
33+
logger = get_logger(__name__, log_level="INFO")
3634

3735

3836
@dataclass
@@ -69,6 +67,29 @@ class KDSFTTrainer(SFTTrainer, KDTrainer):
6967
pass
7068

7169

70+
def _save_model_fsdp_compat(
71+
self,
72+
output_dir: str | None = None,
73+
_internal_call: bool = False,
74+
*args,
75+
**kwargs,
76+
):
77+
output_dir = output_dir or self.args.output_dir
78+
model = self.accelerator.unwrap_model(self.model)
79+
if not _internal_call and self.is_fsdp_enabled:
80+
state_dict = self.accelerator.get_state_dict(self.model)
81+
if self.accelerator.is_main_process:
82+
model.save_pretrained(
83+
output_dir,
84+
is_main_process=self.accelerator.is_main_process,
85+
save_function=self.accelerator.save,
86+
state_dict=state_dict,
87+
)
88+
self.processing_class.save_pretrained(output_dir)
89+
else:
90+
super(SFTTrainer, self).save_model(output_dir, _internal_call, *args, **kwargs)
91+
92+
7293
def train():
7394
parser = transformers.HfArgumentParser((ModelArguments, TrainingArguments))
7495
model_args, training_args = parser.parse_args_into_dataclasses()
@@ -77,6 +98,9 @@ def train():
7798
# modelopt state will be saved automatically to "modelopt_state.pth"
7899
mto.enable_huggingface_checkpointing()
79100

101+
# HACK: Fix FSDP2-incompatible save_model() function for SFTTrainer
102+
SFTTrainer.save_model = _save_model_fsdp_compat
103+
80104
# Set total batch size across all ranks to equal 64
81105
total_batch_size = 64
82106
num_accum_steps = total_batch_size / (
@@ -91,19 +115,22 @@ def train():
91115
f"Using {int(num_accum_steps)} grad accumulation steps for effective batchsize of {total_batch_size}."
92116
)
93117

118+
# Dataset
94119
logger.info("Loading dataset...")
95120
dset = datasets.load_dataset("Open-Orca/OpenOrca", split="train")
96121
dset_splits = dset.train_test_split(train_size=25600, test_size=1700, seed=420)
97122
dset_train, dset_eval = dset_splits["train"], dset_splits["test"]
98123
logger.info("Dataset loaded.")
99124

125+
# Tokenizer
100126
logger.info("Loading tokenizer...")
101127
model_path = model_args.teacher_name_or_path or model_args.student_name_or_path
102128
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
103129
tokenizer.pad_token = tokenizer.eos_token
104130
tokenizer.padding_side = "right"
105131
logger.info("Tokenizer loaded.")
106132

133+
# Model
107134
if model_args.single_model:
108135
logger.info("Loading single model only...")
109136
model = transformers.AutoModelForCausalLM.from_pretrained(
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
pyarrow
2+
transformers<5.0
23
trl>=0.23.0

examples/speculative_decoding/README.md

Lines changed: 54 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,16 @@ pip install -U nvidia-modelopt[hf]
4343
pip install -r requirements.txt
4444
```
4545

46-
We use [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset in this example. Download by:
46+
### Data Preparation
47+
48+
We use [Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater) dataset in this example. Prepare data by:
4749

4850
```bash
49-
apt-get update && apt-get install -y git-lfs
50-
git lfs install --system
51-
git clone https://huggingface.co/datasets/nvidia/Daring-Anteater
51+
python prepare_input_conversations/add_daring_anteater.py
5252
```
5353

54+
See [other-datasets](#other-datasets) section for other dataset options and instruction for user-provided data.
55+
5456
## Getting Started: Simplified Workflow
5557

5658
```bash
@@ -71,7 +73,7 @@ For small base models that fit in GPU memory, we can collocate them with draft m
7173
```bash
7274
./launch_train.sh --model $BASE_MODEL \
7375
--output_dir $OUTPUT_DIR \
74-
--data Daring-Anteater/train.jsonl \
76+
--data input_conversations/daring-anteater.jsonl \
7577
--num_gpu $NUM_GPU \
7678
--num_epochs $NUM_EPOCH \
7779
--eagle_config eagle_config.json
@@ -82,20 +84,35 @@ The saved modelopt checkpoint is similar in architecture to HF models. It can be
8284

8385
## Training Draft Model with Offline Base Model
8486

85-
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
87+
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of disk storage depending on dataset size.
88+
89+
### Dumpping Hidden States to Disk
90+
91+
We support two backends for generating base model hidden states. For better effciency, it is recommended to use TRT-LLM:
92+
93+
```bash
94+
python collect_hidden_states/compute_hidden_states_trtllm.py \
95+
--model $BASE_MODEL \
96+
--input-file input_conversations/daring-anteater.jsonl \
97+
--output-dir $HIDDEN_STATES_DIR
98+
```
8699

87-
First, dump the base model's hidden states with the following command:
100+
**NOTE**: TRT-LLM installation needed for the above command.
101+
102+
Alternatively, you can generate the same hidden states with HF:
88103

89104
```bash
90105
python collect_hidden_states/compute_hidden_states_hf.py \
91106
--model $BASE_MODEL \
92-
--input-file Daring-Anteater/train.jsonl \
107+
--input-file input_conversations/daring-anteater.jsonl \
93108
--output-dir $HIDDEN_STATES_DIR
94109
```
95110

96-
See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
111+
**NOTE**: See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) and [`run_trtllm_compute_hiddens_dp.sh`](./collect_hidden_states/run_trtllm_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
112+
113+
### Train Draft Model with Dumped Hidden States
97114

98-
Then, train draft model with `--offline-data` argument:
115+
Once we finish dumping hidden states, launch offline training with an extra `--offline-data` argument:
99116

100117
```bash
101118
./launch_train.sh --model $BASE_MODEL \
@@ -109,13 +126,13 @@ Then, train draft model with `--offline-data` argument:
109126

110127
## Model Validation
111128

112-
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
129+
For online training checkpoints, we can run in-framework evaluation on MT-bench:
113130

114131
```bash
115-
python ar_validate.py --model_path $OUTPUT_DIR
132+
python ar_validate.py --model_path $ONLINE_CKPT
116133
```
117134

118-
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
135+
**Note**: In-framework evaluation is supported only for online training. For offline training checkpoints, please export the model and evaluate it using serving frameworks.
119136

120137
## Export
121138

@@ -168,6 +185,28 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
168185
169186
## Advanced Usage
170187
188+
### Other Datasets
189+
190+
In addition to `daring-anteater`, we provide scripts for adding several other commonly used datasets in `prepare_input_conversations`:
191+
192+
```text
193+
prepare_input_conversations/
194+
├── add_daring_anteater.py
195+
├── add_mtbench.py
196+
├── add_sharegpt.py
197+
├── add_ultrachat.py
198+
└── example_make_prompt_dataset.sh
199+
```
200+
201+
To use your own datasets, please preprocess your data into a `.jsonl` file with each line in the format:
202+
203+
```json
204+
{
205+
"conversation_id": <unique id>,
206+
"conversations": [{"role":<user or assistant>, "content":<content>}]
207+
}
208+
```
209+
171210
### Data Synthesis
172211

173212
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
@@ -184,7 +223,7 @@ Note: Add `--quantization=modelopt` flag for quantized models.
184223
Then, we generate conversations with the base model using prompts from Daring-Anteater:
185224

186225
```bash
187-
python server_generate.py --data_path Daring-Anteater/train.jsonl --output_path synthetic/train.jsonl
226+
python server_generate.py --data_path input_conversations/daring-anteater.jsonl --output_path synthetic/train.jsonl
188227
```
189228

190229
To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
@@ -196,7 +235,7 @@ For large scale data generation, please see [SLURM prepare data](SLURM_prepare_d
196235
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
197236

198237
```bash
199-
python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data Daring-Anteater/train.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
238+
python calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct --data input_conversations/daring-anteater.jsonl --draft_vocab_size 32000 --save_dir draft_vocab_cache
200239
```
201240

202241
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.

0 commit comments

Comments
 (0)