Skip to content

Commit 2edf3fa

Browse files
committed
update readme
1 parent 6a33bc5 commit 2edf3fa

File tree

3 files changed

+90
-66
lines changed

3 files changed

+90
-66
lines changed

examples/llm_qat/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Quantization Aware Training (QAT) helps to improve the model accuracy beyond pos
1111
| Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models | \[[Link](#support-matrix)\] | |
1212
| End to End QAT | Example scripts demonstrating quantization techniques for optimizing Hugging Face models | \[[Link](#end-to-end-qat-example)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\] |
1313
| End to End QAD | Example scripts demonstrating quantization aware distillation techniques for optimizing Hugging Face models | \[[Link](#end-to-end-qad-example)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\] |
14+
| NeMo QAT/QAD Simplified Flow | Example script demonstrating end-to-end QAT/QAD in NeMo | \[[Link](../nemo_run/qat/README.md)\] | |
1415
| Evaluate Accuracy | Evaluating model accuracy after QAT/QAD (with fake quantization) | \[[Link](#testing-qat-model-with-llm-benchmarks-for-accuracy-evaluation)\] | |
1516
| Deployment | Deploying the model after QAT/QAD | \[[Link](#deployment)\] | |
1617
| QLoRA | Model training with reduced GPU memory | \[[Link](#end-to-end-qlora-with-real-quantization)\] | |

examples/nemo_run/qat/ADVANCED.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# NeMo QAT/QAD Flow: Advanced Topics
2+
3+
If you need to run QAT/QAD on a Slurm cluster (for example to use more than 1 node)
4+
5+
To run the example on slurm, edit the `SLURM_CONFIG` at the bottom of `nemo_qat_flow.py` with the appropriate credentials, container, cluster name (host), and container mounts. Make sure you are mounting the NeMo and Megatron-LM repositories above in the Slurm cluster and that you've checked out the correct commits.
6+
7+
## Running the Flow on Slurm
8+
9+
To launch the Flow on a Slurm cluster, modify your Slurm credentials at the bottom of `nemo_qat_flow.py` and add the `--use-slurm` flag to the command. On a different server (e.g. your local server), launch the NeMo container as described in the [README](README.md) then run `python qat/nemo_qat_flow.py --use-slurm --log-dir /slurm/log/dir`, which will `ssh` into the Slurm cluster, `rsync` your files over, and launch the tasks. The log directory on the Slurm cluster should look like this after an experiment is run (assuming your experiment name is `qat_flow_ckpts`)
10+
11+
**NOTE:** `rsync` may not currently be available in the NeMo container and will be added as a dependency.
12+
13+
```
14+
qat_flow_ckpts qat_flow_ckpts_1755708286
15+
```
16+
17+
If you `cd` into the experiment itself, e.g. `cd qat_flow_ckpts_1755708286`, you'll find a directory structure like the following. Each folder is for a stage of the Simplified Flow, and in each stage you can see the logs for that stage as well as the sbatch command that was run. You can `cd` into each stage and `tail -f` the log file to see the logs while the stage is running.
18+
19+
```
20+
├── 00_openscience_data
21+
│   ├── code
22+
│   ├── configs
23+
│   ├── log-coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664_0.out
24+
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664.out
25+
├── 01_import_model
26+
│   ├── code
27+
│   ├── configs
28+
│   ├── log-coreai_dlalgo_modelopt-modelopt.01_import_model_5345665_0.out
29+
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.01_import_model_5345665.out
30+
├── 02_mmlu_bf16
31+
│   ├── code
32+
│   ├── configs
33+
│   ├── log-coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666_0.out
34+
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666.out
35+
├── 03_ptq
36+
│   ├── code
37+
│   ├── configs
38+
│   ├── log-coreai_dlalgo_modelopt-modelopt.03_ptq_5345667_0.out
39+
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.03_ptq_5345667.out
40+
├── 04_mmlu_ptq
41+
│   ├── code
42+
│   ├── configs
43+
│   ├── log-coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668_0.out
44+
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668.out
45+
├── 05_train
46+
│   ├── code
47+
│   ├── configs
48+
│   ├── log-coreai_dlalgo_modelopt-modelopt.05_train_5345669_0.out
49+
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.05_train_5345669.out
50+
├── 06_mmlu_sft
51+
│   ├── code
52+
│   └── configs
53+
├── 07_export_hf
54+
│   ├── code
55+
│   └── configs
56+
```

examples/nemo_run/qat/README.md

Lines changed: 33 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,19 @@
1+
<div align="center">
2+
13
# NeMo QAT/QAD Simplified Flow Example
24

5+
[Slurm Examples](ADVANCED.md) |
6+
[Advanced Topics](ADVANCED.md) |
7+
[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt)
8+
9+
</div>
10+
311
## Overview
412

513
This directory contains an end-to-end QAT Simplified Flow example using NeMo for model training. It supports both QAT with cross-entropy loss and QAD (quantization-aware distillation) with knowledge-distillation loss between the BF16 teacher and quantized student models.
614

15+
After PTQ (post-training quantization), the quantized model may
16+
717
## Flow Stages
818

919
Currently the Simplified Flow runs the following steps in order:
@@ -17,40 +27,32 @@ Currently the Simplified Flow runs the following steps in order:
1727

1828
```mermaid
1929
graph TD;
20-
Data-->SFT;
21-
Import-->Evaluate_BF16;
22-
Import-->PTQ;
23-
PTQ-->Evaluate_PTQ;
24-
PTQ --> SFT;
25-
SFT-->Evaluate_SFT;
26-
SFT-->Export_SFT;
30+
00_openscience_data-->05_train;
31+
01_import_model-->02_mmlu_bf16;
32+
01_import_model-->03_ptq;
33+
03_ptq-->04_mmlu_ptq;
34+
03_ptq-->05_train;
35+
05_train-->06_mmlu_sft;
36+
05_train-->07_export_hf;
2737
```
2838

29-
## Supported models
30-
31-
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
32-
33-
The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GPUs for training with the following model:
34-
35-
- **Model**: Qwen3-8B
36-
- **Recipe**: qwen3_8b
3739

3840
## Usage
3941

4042
### Prerequisites
4143

42-
You can run the example either locally or on a Slurm cluster.
44+
You can run the example either locally or on a [Slurm cluster](ADVANCED.md).
4345

44-
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.07 or higher using Docker on on a Slurm interactive node. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
45-
46-
To run SFT properly you may also need to clone NeMo at the respective commits, and mount to `/opt/NeMo`:
46+
To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.07 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.
4747

48+
- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git`
4849
- `git clone https://github.com/NVIDIA-NeMo/NeMo.git && cd NeMo && git checkout ddcb75f`
4950

50-
To run the example on slurm, edit the `SLURM_CONFIG` at the bottom of `nemo_qat_flow.py` with the appropriate credentials, container, cluster name (host), and container mounts. Make sure you are mounting the NeMo and Megatron-LM repositories above in the Slurm cluster and that you've checked out the correct commits.
51+
Example docker command:
52+
```
53+
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.07 bash
54+
```
5155

52-
### Dataset limitations
53-
The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.
5456

5557
### Running the Flow Locally
5658

@@ -78,55 +80,20 @@ To perform QAD training, run:
7880
python qat/nemo_qat_flow.py --distill --log-dir /my/log/dir --experiment qad_experiment
7981
```
8082

81-
### Running the Flow on Slurm
8283

83-
To launch the Flow on a Slurm cluster, modify your Slurm credentials at the bottom of `nemo_qat_flow.py` and add the `--use-slurm` flag to the command. On a different server (e.g. your local server), launch the NeMo container above then run `python qat/nemo_qat_flow.py --use-slurm --log-dir /slurm/log/dir`, which will `ssh` into the Slurm cluster, `rsync` your files over, and launch the tasks. The log directory on the Slurm cluster should look like this after an experiment is run (assuming your experiment name is `qat_flow_ckpts`)
84+
## Supported models
8485

85-
```
86-
qat_flow_ckpts qat_flow_ckpts_1755708286
87-
```
86+
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
8887

89-
If you `cd` into the experiment itself, e.g. `cd qat_flow_ckpts_1755708286`, you'll find a directory structure like the following. Each folder is for a stage of the Simplified Flow, and in each stage you can see the logs for that stage as well as the sbatch command that was run. You can `cd` into each stage and `tail -f` the log file to see the logs while the stage is running.
88+
The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GPUs for training with the following model:
89+
90+
- **Model**: Qwen3-8B
91+
- **Recipe**: qwen3_8b
9092

91-
```
92-
├── 00_openscience_data
93-
│   ├── code
94-
│   ├── configs
95-
│   ├── log-coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664_0.out
96-
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664.out
97-
├── 01_import_model
98-
│   ├── code
99-
│   ├── configs
100-
│   ├── log-coreai_dlalgo_modelopt-modelopt.01_import_model_5345665_0.out
101-
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.01_import_model_5345665.out
102-
├── 02_mmlu_bf16
103-
│   ├── code
104-
│   ├── configs
105-
│   ├── log-coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666_0.out
106-
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666.out
107-
├── 03_ptq
108-
│   ├── code
109-
│   ├── configs
110-
│   ├── log-coreai_dlalgo_modelopt-modelopt.03_ptq_5345667_0.out
111-
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.03_ptq_5345667.out
112-
├── 04_mmlu_ptq
113-
│   ├── code
114-
│   ├── configs
115-
│   ├── log-coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668_0.out
116-
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668.out
117-
├── 05_train
118-
│   ├── code
119-
│   ├── configs
120-
│   ├── log-coreai_dlalgo_modelopt-modelopt.05_train_5345669_0.out
121-
│   └── sbatch_coreai_dlalgo_modelopt-modelopt.05_train_5345669.out
122-
├── 06_mmlu_sft
123-
│   ├── code
124-
│   └── configs
125-
├── 07_export_hf
126-
│   ├── code
127-
│   └── configs
128-
```
12993

13094
### Custom Chat Template
13195

13296
By default the script will use the model/tokenizer's chat template, which may not contain the `{% generation %}` and `{% endgeneration %}` tags around the assistant tokens which are needed to generate the assistant loss mask (see [this PR](https://github.com/huggingface/transformers/pull/30650)). To provide path to a custom chat template, use the `--chat-template <my_template.txt>` flag.
97+
98+
### Dataset limitations
99+
The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.

0 commit comments

Comments
 (0)