You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/llm_qat/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,6 +11,7 @@ Quantization Aware Training (QAT) helps to improve the model accuracy beyond pos
11
11
| Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models |\[[Link](#support-matrix)\]||
12
12
| End to End QAT | Example scripts demonstrating quantization techniques for optimizing Hugging Face models |\[[Link](#end-to-end-qat-example)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]|
13
13
| End to End QAD | Example scripts demonstrating quantization aware distillation techniques for optimizing Hugging Face models |\[[Link](#end-to-end-qad-example)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]|
14
+
| NeMo QAT/QAD Simplified Flow | Example script demonstrating end-to-end QAT/QAD in NeMo |\[[Link](../nemo_run/qat/README.md)\]||
14
15
| Evaluate Accuracy | Evaluating model accuracy after QAT/QAD (with fake quantization) |\[[Link](#testing-qat-model-with-llm-benchmarks-for-accuracy-evaluation)\]||
15
16
| Deployment | Deploying the model after QAT/QAD |\[[Link](#deployment)\]||
16
17
| QLoRA | Model training with reduced GPU memory |\[[Link](#end-to-end-qlora-with-real-quantization)\]||
If you need to run QAT/QAD on a Slurm cluster (for example to use more than 1 node)
4
+
5
+
To run the example on slurm, edit the `SLURM_CONFIG` at the bottom of `nemo_qat_flow.py` with the appropriate credentials, container, cluster name (host), and container mounts. Make sure you are mounting the NeMo and Megatron-LM repositories above in the Slurm cluster and that you've checked out the correct commits.
6
+
7
+
## Running the Flow on Slurm
8
+
9
+
To launch the Flow on a Slurm cluster, modify your Slurm credentials at the bottom of `nemo_qat_flow.py` and add the `--use-slurm` flag to the command. On a different server (e.g. your local server), launch the NeMo container as described in the [README](README.md) then run `python qat/nemo_qat_flow.py --use-slurm --log-dir /slurm/log/dir`, which will `ssh` into the Slurm cluster, `rsync` your files over, and launch the tasks. The log directory on the Slurm cluster should look like this after an experiment is run (assuming your experiment name is `qat_flow_ckpts`)
10
+
11
+
**NOTE:**`rsync` may not currently be available in the NeMo container and will be added as a dependency.
12
+
13
+
```
14
+
qat_flow_ckpts qat_flow_ckpts_1755708286
15
+
```
16
+
17
+
If you `cd` into the experiment itself, e.g. `cd qat_flow_ckpts_1755708286`, you'll find a directory structure like the following. Each folder is for a stage of the Simplified Flow, and in each stage you can see the logs for that stage as well as the sbatch command that was run. You can `cd` into each stage and `tail -f` the log file to see the logs while the stage is running.
This directory contains an end-to-end QAT Simplified Flow example using NeMo for model training. It supports both QAT with cross-entropy loss and QAD (quantization-aware distillation) with knowledge-distillation loss between the BF16 teacher and quantized student models.
6
14
15
+
After PTQ (post-training quantization), the quantized model may
16
+
7
17
## Flow Stages
8
18
9
19
Currently the Simplified Flow runs the following steps in order:
@@ -17,40 +27,32 @@ Currently the Simplified Flow runs the following steps in order:
17
27
18
28
```mermaid
19
29
graph TD;
20
-
Data-->SFT;
21
-
Import-->Evaluate_BF16;
22
-
Import-->PTQ;
23
-
PTQ-->Evaluate_PTQ;
24
-
PTQ --> SFT;
25
-
SFT-->Evaluate_SFT;
26
-
SFT-->Export_SFT;
30
+
00_openscience_data-->05_train;
31
+
01_import_model-->02_mmlu_bf16;
32
+
01_import_model-->03_ptq;
33
+
03_ptq-->04_mmlu_ptq;
34
+
03_ptq-->05_train;
35
+
05_train-->06_mmlu_sft;
36
+
05_train-->07_export_hf;
27
37
```
28
38
29
-
## Supported models
30
-
31
-
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
32
-
33
-
The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GPUs for training with the following model:
34
-
35
-
-**Model**: Qwen3-8B
36
-
-**Recipe**: qwen3_8b
37
39
38
40
## Usage
39
41
40
42
### Prerequisites
41
43
42
-
You can run the example either locally or on a Slurm cluster.
44
+
You can run the example either locally or on a [Slurm cluster](ADVANCED.md).
43
45
44
-
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.07 or higher using Docker on on a Slurm interactive node. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
45
-
46
-
To run SFT properly you may also need to clone NeMo at the respective commits, and mount to `/opt/NeMo`:
46
+
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.07 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.
-`git clone https://github.com/NVIDIA-NeMo/NeMo.git && cd NeMo && git checkout ddcb75f`
49
50
50
-
To run the example on slurm, edit the `SLURM_CONFIG` at the bottom of `nemo_qat_flow.py` with the appropriate credentials, container, cluster name (host), and container mounts. Make sure you are mounting the NeMo and Megatron-LM repositories above in the Slurm cluster and that you've checked out the correct commits.
51
+
Example docker command:
52
+
```
53
+
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.07 bash
54
+
```
51
55
52
-
### Dataset limitations
53
-
The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.
To launch the Flow on a Slurm cluster, modify your Slurm credentials at the bottom of `nemo_qat_flow.py` and add the `--use-slurm` flag to the command. On a different server (e.g. your local server), launch the NeMo container above then run `python qat/nemo_qat_flow.py --use-slurm --log-dir /slurm/log/dir`, which will `ssh` into the Slurm cluster, `rsync` your files over, and launch the tasks. The log directory on the Slurm cluster should look like this after an experiment is run (assuming your experiment name is `qat_flow_ckpts`)
84
+
## Supported models
84
85
85
-
```
86
-
qat_flow_ckpts qat_flow_ckpts_1755708286
87
-
```
86
+
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
88
87
89
-
If you `cd` into the experiment itself, e.g. `cd qat_flow_ckpts_1755708286`, you'll find a directory structure like the following. Each folder is for a stage of the Simplified Flow, and in each stage you can see the logs for that stage as well as the sbatch command that was run. You can `cd` into each stage and `tail -f` the log file to see the logs while the stage is running.
88
+
The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GPUs for training with the following model:
By default the script will use the model/tokenizer's chat template, which may not contain the `{% generation %}` and `{% endgeneration %}` tags around the assistant tokens which are needed to generate the assistant loss mask (see [this PR](https://github.com/huggingface/transformers/pull/30650)). To provide path to a custom chat template, use the `--chat-template <my_template.txt>` flag.
97
+
98
+
### Dataset limitations
99
+
The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.
0 commit comments