You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/nemo_run/qat/README.md
+3-5Lines changed: 3 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@
12
12
13
13
This directory contains an end-to-end QAT Simplified Flow example using NeMo for model training. It supports both QAT with cross-entropy loss and QAD (quantization-aware distillation) with knowledge-distillation loss between the BF16 teacher and quantized student models.
14
14
15
-
After PTQ (post-training quantization), the quantized model may
15
+
After PTQ (post-training quantization), the quantized model may
16
16
17
17
## Flow Stages
18
18
@@ -36,7 +36,6 @@ graph TD;
36
36
05_train-->07_export_hf;
37
37
```
38
38
39
-
40
39
## Usage
41
40
42
41
### Prerequisites
@@ -49,11 +48,11 @@ To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia
49
48
-`git clone https://github.com/NVIDIA-NeMo/NeMo.git && cd NeMo && git checkout ddcb75f`
50
49
51
50
Example docker command:
51
+
52
52
```
53
53
docker run -v /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.07 bash
54
54
```
55
55
56
-
57
56
### Running the Flow Locally
58
57
59
58
After launching the NeMo container with the specified mounts, follow these examples to run the flow locally.
Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
@@ -90,10 +88,10 @@ The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GP
90
88
-**Model**: Qwen3-8B
91
89
-**Recipe**: qwen3_8b
92
90
93
-
94
91
### Custom Chat Template
95
92
96
93
By default the script will use the model/tokenizer's chat template, which may not contain the `{% generation %}` and `{% endgeneration %}` tags around the assistant tokens which are needed to generate the assistant loss mask (see [this PR](https://github.com/huggingface/transformers/pull/30650)). To provide path to a custom chat template, use the `--chat-template <my_template.txt>` flag.
97
94
98
95
### Dataset limitations
96
+
99
97
The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.
0 commit comments