You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ Model Optimizer Changelog (Linux)
9
9
**New Features**
10
10
11
11
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
12
+
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
12
13
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
26
26
27
27
## Latest News
28
28
29
+
-[2025/10/07][Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
29
30
-[2025/09/17][An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
30
31
-[2025/09/11][How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
31
32
-[2025/08/29][Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
Copy file name to clipboardExpand all lines: docs/source/deployment/1_tensorrt_llm.rst
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,15 @@
2
2
TensorRT-LLM
3
3
==========================
4
4
5
+
**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
6
+
5
7
.. note::
6
8
7
-
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
9
+
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
8
10
first before going through this section.
9
11
10
12
13
+
11
14
ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
12
15
13
16
This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
144
147
Convert to TensorRT-LLM
145
148
=======================
146
149
147
-
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
150
+
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
@@ -82,20 +84,35 @@ The saved modelopt checkpoint is similar in architecture to HF models. It can be
82
84
83
85
## Training Draft Model with Offline Base Model
84
86
85
-
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of storage depending on dataset size.
87
+
For large models, you can export intermediate hidden states to disk and train only the draft model. This significantly reduces GPU memory requirements, but requires several to tens of terabytes of disk storage depending on dataset size.
88
+
89
+
### Dumpping Hidden States to Disk
90
+
91
+
We support two backends for generating base model hidden states. For better effciency, it is recommended to use TRT-LLM:
See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
111
+
**NOTE**: See [`run_hf_compute_hiddens_dp.sh`](./collect_hidden_states/run_hf_compute_hiddens_dp.sh) and [`run_trtllm_compute_hiddens_dp.sh`](./collect_hidden_states/run_trtllm_compute_hiddens_dp.sh) for a simple example using data parallelism (DP) to accelerate hidden state generation.
112
+
113
+
### Train Draft Model with Dumped Hidden States
97
114
98
-
Then, train draft model with `--offline-data` argument:
115
+
Once we finish dumping hidden states, launch offline training with an extra`--offline-data` argument:
99
116
100
117
```bash
101
118
./launch_train.sh --model $BASE_MODEL \
@@ -109,13 +126,13 @@ Then, train draft model with `--offline-data` argument:
109
126
110
127
## Model Validation
111
128
112
-
After training draft model, we can evaluate the saved modelopt checkpoint on MT-bench by:
129
+
For online training checkpoints, we can run in-framework evaluation on MT-bench:
113
130
114
131
```bash
115
-
python ar_validate.py --model_path $OUTPUT_DIR
132
+
python ar_validate.py --model_path $ONLINE_CKPT
116
133
```
117
134
118
-
Alternatively, we can export the checkpoint and run evaluation on serving frameworks. See sections below.
135
+
**Note**: In-framework evaluation is supported only for online training. For offline training checkpoints, please export the model and evaluate it using serving frameworks.
119
136
120
137
## Export
121
138
@@ -168,6 +185,28 @@ See more details on deployment of quantized model to TRTLLM [here](../llm_ptq/RE
168
185
169
186
## Advanced Usage
170
187
188
+
### Other Datasets
189
+
190
+
In addition to `daring-anteater`, we provide scripts for adding several other commonly used datasets in `prepare_input_conversations`:
191
+
192
+
```text
193
+
prepare_input_conversations/
194
+
├── add_daring_anteater.py
195
+
├── add_mtbench.py
196
+
├── add_sharegpt.py
197
+
├── add_ultrachat.py
198
+
└── example_make_prompt_dataset.sh
199
+
```
200
+
201
+
To use your own datasets, please preprocess your data into a `.jsonl` file with each line in the format:
202
+
203
+
```json
204
+
{
205
+
"conversation_id": <unique id>,
206
+
"conversations": [{"role":<user or assistant>, "content":<content>}]
207
+
}
208
+
```
209
+
171
210
### Data Synthesis
172
211
173
212
To achieve higher acceptance rates during speculative decoding, it is beneficial to use conversations generated by the base model as training data. This ensures that the draft model's output distribution closely aligns with that of the base model.
@@ -184,7 +223,7 @@ Note: Add `--quantization=modelopt` flag for quantized models.
184
223
Then, we generate conversations with the base model using prompts from Daring-Anteater:
To add a system prompt, use the `--system_prompt <system_prompt_text>` argument.
@@ -196,7 +235,7 @@ For large scale data generation, please see [SLURM prepare data](SLURM_prepare_d
196
235
We can optionally use smaller vocab size for the draft model for faster training and inference. E.g. Llama3.2-1B has a vocab size of 128256. In this example, we construct a draft vocab mapping of size 32k by finding the most commonly appeared vocabs in our training set:
This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.
0 commit comments