update readme

jenchen13 · jenchen13 · commit 2edf3fa86d12 · 2025-09-05T17:06:25.000Z
diff --git a/examples/llm_qat/README.md b/examples/llm_qat/README.md
@@ -11,6 +11,7 @@ Quantization Aware Training (QAT) helps to improve the model accuracy beyond pos
 | Support Matrix | View the support matrix to see quantization compatibility and feature availability across different models | \[[Link](#support-matrix)\] | |
 | End to End QAT | Example scripts demonstrating quantization techniques for optimizing Hugging Face models | \[[Link](#end-to-end-qat-example)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\] |
 | End to End QAD | Example scripts demonstrating quantization aware distillation techniques for optimizing Hugging Face models | \[[Link](#end-to-end-qad-example)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\] |
+| NeMo QAT/QAD Simplified Flow | Example script demonstrating end-to-end QAT/QAD in NeMo | \[[Link](../nemo_run/qat/README.md)\] | |
 | Evaluate Accuracy | Evaluating model accuracy after QAT/QAD (with fake quantization) | \[[Link](#testing-qat-model-with-llm-benchmarks-for-accuracy-evaluation)\] | |
 | Deployment | Deploying the model after QAT/QAD | \[[Link](#deployment)\] | |
 | QLoRA | Model training with reduced GPU memory | \[[Link](#end-to-end-qlora-with-real-quantization)\] | |
diff --git a/examples/nemo_run/qat/ADVANCED.md b/examples/nemo_run/qat/ADVANCED.md
@@ -0,0 +1,56 @@
+# NeMo QAT/QAD Flow: Advanced Topics
+
+If you need to run QAT/QAD on a Slurm cluster (for example to use more than 1 node)
+
+To run the example on slurm, edit the `SLURM_CONFIG` at the bottom of `nemo_qat_flow.py` with the appropriate credentials, container, cluster name (host), and container mounts. Make sure you are mounting the NeMo and Megatron-LM repositories above in the Slurm cluster and that you've checked out the correct commits.
+
+## Running the Flow on Slurm
+
+To launch the Flow on a Slurm cluster, modify your Slurm credentials at the bottom of `nemo_qat_flow.py` and add the `--use-slurm` flag to the command. On a different server (e.g. your local server), launch the NeMo container as described in the [README](README.md) then run `python qat/nemo_qat_flow.py --use-slurm --log-dir /slurm/log/dir`, which will `ssh` into the Slurm cluster, `rsync` your files over, and launch the tasks. The log directory on the Slurm cluster should look like this after an experiment is run (assuming your experiment name is `qat_flow_ckpts`)
+
+**NOTE:** `rsync` may not currently be available in the NeMo container and will be added as a dependency.
+
+```
+qat_flow_ckpts qat_flow_ckpts_1755708286
+```
+
+If you `cd` into the experiment itself, e.g. `cd qat_flow_ckpts_1755708286`, you'll find a directory structure like the following. Each folder is for a stage of the Simplified Flow, and in each stage you can see the logs for that stage as well as the sbatch command that was run. You can `cd` into each stage and `tail -f` the log file to see the logs while the stage is running.
+
+```
+├── 00_openscience_data
+│   ├── code
+│   ├── configs
+│   ├── log-coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664_0.out
+│   └── sbatch_coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664.out
+├── 01_import_model
+│   ├── code
+│   ├── configs
+│   ├── log-coreai_dlalgo_modelopt-modelopt.01_import_model_5345665_0.out
+│   └── sbatch_coreai_dlalgo_modelopt-modelopt.01_import_model_5345665.out
+├── 02_mmlu_bf16
+│   ├── code
+│   ├── configs
+│   ├── log-coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666_0.out
+│   └── sbatch_coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666.out
+├── 03_ptq
+│   ├── code
+│   ├── configs
+│   ├── log-coreai_dlalgo_modelopt-modelopt.03_ptq_5345667_0.out
+│   └── sbatch_coreai_dlalgo_modelopt-modelopt.03_ptq_5345667.out
+├── 04_mmlu_ptq
+│   ├── code
+│   ├── configs
+│   ├── log-coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668_0.out
+│   └── sbatch_coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668.out
+├── 05_train
+│   ├── code
+│   ├── configs
+│   ├── log-coreai_dlalgo_modelopt-modelopt.05_train_5345669_0.out
+│   └── sbatch_coreai_dlalgo_modelopt-modelopt.05_train_5345669.out
+├── 06_mmlu_sft
+│   ├── code
+│   └── configs
+├── 07_export_hf
+│   ├── code
+│   └── configs
+```
diff --git a/examples/nemo_run/qat/README.md b/examples/nemo_run/qat/README.md
@@ -1,9 +1,19 @@
+<div align="center">
+
 # NeMo QAT/QAD Simplified Flow Example
 
+[Slurm Examples](ADVANCED.md) |
+[Advanced Topics](ADVANCED.md) |
+[NeMo Integration](https://github.com/NVIDIA-NeMo/NeMo/tree/main/nemo/collections/llm/modelopt)
+
+</div>
+
 ## Overview
 
 This directory contains an end-to-end QAT Simplified Flow example using NeMo for model training. It supports both QAT with cross-entropy loss and QAD (quantization-aware distillation) with knowledge-distillation loss between the BF16 teacher and quantized student models.
 
+After PTQ (post-training quantization), the quantized model may 
+
 ## Flow Stages
 
 Currently the Simplified Flow runs the following steps in order:
@@ -17,40 +27,32 @@ Currently the Simplified Flow runs the following steps in order:
 
 ```mermaid
 graph TD;
-Data-->SFT;
-Import-->Evaluate_BF16;
-Import-->PTQ;
-PTQ-->Evaluate_PTQ;
-PTQ --> SFT;
-SFT-->Evaluate_SFT;
-SFT-->Export_SFT;
+00_openscience_data-->05_train;
+01_import_model-->02_mmlu_bf16;
+01_import_model-->03_ptq;
+03_ptq-->04_mmlu_ptq;
+03_ptq-->05_train;
+05_train-->06_mmlu_sft;
+05_train-->07_export_hf;
 ```
 
-## Supported models
-
-Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
-
-The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GPUs for training with the following model:
-
-- **Model**: Qwen3-8B
-- **Recipe**: qwen3_8b
 
 ## Usage
 
 ### Prerequisites
 
-You can run the example either locally  or on a Slurm cluster.
+You can run the example either locally  or on a [Slurm cluster](ADVANCED.md).
 
-To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.07 or higher using Docker on on a Slurm interactive node. Mount your cloned `modelopt` repository to the container by adding this mount flag to your Docker/Slurm command: `-v <modelopt-path>:/workspace/modelopt -v <modelopt-path>/modelopt:/usr/local/lib/python3.12/dist-packages/modelopt`.
-
-To run SFT properly you may also need to clone NeMo at the respective commits, and mount to `/opt/NeMo`:
+To run the example locally, launch a [NeMo container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) with version 25.07 or higher. Clone the `TensorRT-Model-Optimizer` repository and `NeMo` repository (checkout a specific commit for NeMo), then mount it onto your docker container.
 
+- `git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git`
 - `git clone https://github.com/NVIDIA-NeMo/NeMo.git && cd NeMo && git checkout ddcb75f`
 
-To run the example on slurm, edit the `SLURM_CONFIG` at the bottom of `nemo_qat_flow.py` with the appropriate credentials, container, cluster name (host), and container mounts. Make sure you are mounting the NeMo and Megatron-LM repositories above in the Slurm cluster and that you've checked out the correct commits.
+Example docker command:
+```
+docker run -v  /home/user/:/home/user/ -v /home/user/NeMo:/opt/NeMo -v /home/user/TensorRT-Model-Optimizer/modelopt/:/usr/local/lib/python3.12/dist-packages/modelopt --gpus all -it --shm-size 20g --rm nvcr.io/nvidia/nemo:25.07 bash
+```
 
-### Dataset limitations
-The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.
 
 ### Running the Flow Locally
 
@@ -78,55 +80,20 @@ To perform QAD training, run:
 python qat/nemo_qat_flow.py --distill --log-dir /my/log/dir --experiment qad_experiment
 ```
 
-### Running the Flow on Slurm
 
-To launch the Flow on a Slurm cluster, modify your Slurm credentials at the bottom of `nemo_qat_flow.py` and add the `--use-slurm` flag to the command. On a different server (e.g. your local server), launch the NeMo container above then run `python qat/nemo_qat_flow.py --use-slurm --log-dir /slurm/log/dir`, which will `ssh` into the Slurm cluster, `rsync` your files over, and launch the tasks. The log directory on the Slurm cluster should look like this after an experiment is run (assuming your experiment name is `qat_flow_ckpts`)
+## Supported models
 
-```
-qat_flow_ckpts qat_flow_ckpts_1755708286
-```
+Locally this script currently supports models that can be trained on 1 node with 8 x 80GB GPUs. On Slurm you can configure the number of nodes/gpus for training and PTQ with the following flags: `--train-nodes`, `--train-gpus`, `--ptq-gpus`.
 
-If you `cd` into the experiment itself, e.g. `cd qat_flow_ckpts_1755708286`, you'll find a directory structure like the following. Each folder is for a stage of the Simplified Flow, and in each stage you can see the logs for that stage as well as the sbatch command that was run. You can `cd` into each stage and `tail -f` the log file to see the logs while the stage is running.
+The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GPUs for training with the following model:
+
+- **Model**: Qwen3-8B
+- **Recipe**: qwen3_8b
 
-```
-├── 00_openscience_data
-│   ├── code
-│   ├── configs
-│   ├── log-coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664_0.out
-│   └── sbatch_coreai_dlalgo_modelopt-modelopt.00_openscience_data_5345664.out
-├── 01_import_model
-│   ├── code
-│   ├── configs
-│   ├── log-coreai_dlalgo_modelopt-modelopt.01_import_model_5345665_0.out
-│   └── sbatch_coreai_dlalgo_modelopt-modelopt.01_import_model_5345665.out
-├── 02_mmlu_bf16
-│   ├── code
-│   ├── configs
-│   ├── log-coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666_0.out
-│   └── sbatch_coreai_dlalgo_modelopt-modelopt.02_mmlu_bf16_5345666.out
-├── 03_ptq
-│   ├── code
-│   ├── configs
-│   ├── log-coreai_dlalgo_modelopt-modelopt.03_ptq_5345667_0.out
-│   └── sbatch_coreai_dlalgo_modelopt-modelopt.03_ptq_5345667.out
-├── 04_mmlu_ptq
-│   ├── code
-│   ├── configs
-│   ├── log-coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668_0.out
-│   └── sbatch_coreai_dlalgo_modelopt-modelopt.04_mmlu_ptq_5345668.out
-├── 05_train
-│   ├── code
-│   ├── configs
-│   ├── log-coreai_dlalgo_modelopt-modelopt.05_train_5345669_0.out
-│   └── sbatch_coreai_dlalgo_modelopt-modelopt.05_train_5345669.out
-├── 06_mmlu_sft
-│   ├── code
-│   └── configs
-├── 07_export_hf
-│   ├── code
-│   └── configs
-```
 
 ### Custom Chat Template
 
 By default the script will use the model/tokenizer's chat template, which may not contain the `{% generation %}` and `{% endgeneration %}` tags around the assistant tokens which are needed to generate the assistant loss mask (see [this PR](https://github.com/huggingface/transformers/pull/30650)). To provide path to a custom chat template, use the `--chat-template <my_template.txt>` flag.
+
+### Dataset limitations
+The current QAT recipe has been tuned for the Qwen3-8B model to improve accuracy on the MMLU benchmark after PTQ degradation. QAT/QAD results are highly dependent on the specific model, dataset, and hyperparameters. There is no guarantee that the same dataset will recover the accuracy of the PTQ model. Feel free to try your own model and dataset combinations and test which combination works best.