add results to readme

jenchen13 · jenchen13 · commit e6f3ed778828 · 2025-09-17T16:07:18.000Z
Signed-off-by: Jennifer Chen &lt;jennifchen@nvidia.com&gt;
diff --git a/examples/nemo_run/qat/README.md b/examples/nemo_run/qat/README.md
@@ -38,6 +38,16 @@ graph TD;
 05_train-->07_export_hf;
 ```
 
+## Results
+
+QAT of Qwen3-8B NVFP4 recovers most of the accuracy on the MMLU benchmark after NVFP4 PTQ. We finetune the Qwen3-8B NVFP4 checkpoint for 200 steps with a learning rate of 1e-5 and global batch size of 512.
+
+|                           | MMLU 5% |
+|---------------------------|---------|
+| Qwen3-8B FP16             | 73.8    |
+| Qwen3-8B NVFP4            | 70.3    |
+| Qwen3-8B NVFP4 after QAT  | 72.8    |
+
 ## Usage
 
 ### Prerequisites
@@ -92,6 +102,10 @@ The default configuration works on 1 node with 4 H100 GPUs for PTQ and 8 H100 GP
 - **Model**: Qwen3-8B
 - **Recipe**: qwen3_8b
 
+### Common Errors
+
+Depending on the amount of memory your GPUs have, you may get an Out of Memory error. If that happens, add flags for `--tensor_parallelism` or `--pipeline_parallelism` (e.g. `--tensor_parallelism 2`).
+
 ### Custom Chat Template
 
 By default the script will use the model/tokenizer's chat template, which may not contain the `{% generation %}` and `{% endgeneration %}` tags around the assistant tokens which are needed to generate the assistant loss mask (see [this PR](https://github.com/huggingface/transformers/pull/30650)). To provide path to a custom chat template, use the `--chat-template <my_template.txt>` flag.