Skip to content

Commit 6423b36

Browse files
docs: Update gpt-oss qat blog (#191)
* Update 2025-08-28-gpt-oss-qat.md * minor update * add update note * fix typo --------- Co-authored-by: Eduardo Alvarez <[email protected]>
1 parent 8984235 commit 6423b36

File tree

1 file changed

+37
-41
lines changed

1 file changed

+37
-41
lines changed

blog/2025-08-28-gpt-oss-qat.md

Lines changed: 37 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,42 @@
11
---
2-
title: "Finetune and deploy GPT-OSS in MXFP4: ModelOpt+SGLang"
2+
title: "Fine-tune and deploy gpt-oss MXFP4: ModelOpt + SGLang"
33
author: "NVIDIA ModelOpt Team"
44
date: "Aug 28, 2025"
55
previewImg: /images/blog/nvidia-gpt-oss-qat/preview-gpt-oss-qat.png
66
---
77

8+
(Updated on Aug 29)
89

9-
GPT-OSS, the first open-source model family from OpenAI's lab since GPT-2, demonstrates strong math, coding, and general capabilities even when compared with much larger models. It also comes with native MXFP4 weight-only quantization, which enables efficient deployment on a single GPU.
10+
OpenAI recently released gpt-oss, the first open source model family from OpenAI's lab since GPT-2. These models demonstrate strong math, coding, and general capabilities. Part of the model's uniqueness is that it was released in native MXFP4 weight only quantization. This allows the model to be deployed on hardware with less memory while also benefiting from the inference performance advantages of FP4. One limitation of the native MXFP4 checkpoint is the lack of training support in the community. Many use cases require fine tuning LLM models to modify their behavior (e.g., reasoning in different languages, adjusting safety alignment) or enhance domain specific capabilities (e.g., function calling, SQL scripting). Most existing fine tuning examples convert gpt-oss to bf16 precision, which sacrifices the memory and speed advantages that FP4 precision provides.
1011

11-
However, a significant limitation of MXFP4 is the lack of training support in the community for GPT-OSS. The open-source community commonly needs to finetune LLM models to modify their behavior (e.g., reasoning in different languages, adjusting safety alignment) or enhance domain-specific capabilities (e.g., function calling, SQL scripting). Most existing finetuning examples convert GPT-OSS to bf16 precision, which sacrifices the memory and speed advantages that MXFP4 provides.
12-
13-
In this blog, we demonstrate how to finetune LLMs while preserving MXFP4 precision using Quantization-aware Training (QAT) in NVIDIA Model Optimizer, then deploy the resulting model with SGLang. Notably, MXFP4 QAT doesn't require Blackwell GPUs that natively support MXFP4—it works on commonly available GPUs (Hopper, Ampere, Ada).
12+
In this blog, we demonstrate how to fine tune LLMs while preserving FP4 precision using Quantization Aware Training (QAT) in NVIDIA Model Optimizer. We then show how to deploy the resulting model with SGLang. Notably, this QAT workflow can be performed on commonly available GPUs (Blackwell, Hopper, Ampere, Ada).
1413

1514
### What is Quantization-Aware Training (QAT)
1615

17-
QAT is a training technique to recover model accuracy from quantization. We show above a simplified illustration of QAT. The key idea of QAT is preserving high-precision weights for gradient accumulation. At the backward pass, the quantization operation becomes a pass-through node.
16+
QAT is a training technique to recover model accuracy from quantization (simple illustration below). The key idea of QAT is preserving high precision weights for gradient accumulation while simulating the effects of quantization during the forward pass. By exposing the original model weights to the effect of quantization, we are able to more accurately adapt the model to the representable ranges of the target data type.
17+
1818
![qat.png](/images/blog/nvidia-gpt-oss-qat/qat.png)
1919

20-
Below is a more detailed guide of QAT:
20+
#### Different Low-Precision Training Techniques
21+
It should be noted that native quantized training and QLoRA are often confused with QAT, but they serve different purposes. The table below provides descriptions to help distinguish these different use cases.
2122

22-
- Step 1 (Optional): Train/fine-tune the model in the original precision. This makes sure a good starting point before QAT.
23-
- Step 2: Insert quantizer nodes into the model graph. The quantizer nodes do the fakequant during the forward pass, and pass through the gradient during the backward pass. This step is handled by ModelOpt.
24-
- Step 3: Finetune the quantized model in the same way as the original model, with a reduced learning rate (1e-4 to 1e-5). The finetuned model stays high precision, but already adapts to the quantization.
25-
- Step 4: Export the QAT model to a materialized quantized checkpoint and deploy.
23+
| Technique | Description |
24+
|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
25+
| **QLoRA** | Reduces training memory for LoRA fine tuning. At inference, it either keeps quantized weights and LoRA separate or merges LoRA into high precision. weights. |
26+
| **Native quantized training** | Enables efficient training and inference. Requires native hardware support. |
27+
| **QAT** | Improves quantized inference accuracy. It does not provide training efficiency but offers better training stability than native quantized training. |
2628

27-
It should be noted that native quantized training and QLoRA are often confused with QAT, but they serve different purposes.
29+
### QAT Fine-tuning Recipe for gpt-oss
30+
The steps to perform QAT fine tuning are quite straightforward and can be completed in a few steps:
2831

29-
- **QLoRA** reduces training memory for LoRA finetuning. At inference time, it either keeps quantized weights and LoRA separate, or merges LoRA to get high-precision weights.
30-
- **Native quantized training** enables efficient training and inference. Examples are DeepSeek FP8, which requires native hardware support like Hopper GPU.
31-
- **QAT** empowers quantized inference with better accuracy. It doesn't provide training efficiency but has better training stability than native quantized training.
32+
- **Step 1 (Optional)**: Fine tune the model in the original precision. This establishes a good starting point before QAT.
33+
- **Step 2**: Insert quantizer nodes into the model graph. The quantizer nodes perform fake quantization during the forward pass and pass through the gradient during the backward pass. This step is handled by Model Optimizer.
34+
- **Step 3**: Fine tune the quantized model in the same way as the original model, with a reduced learning rate (1e-4 to 1e-5). The fine tuned model stays high precision but uses QAT in this step.
35+
- **Step 4**: Export the QAT quantized checkpoint and deploy.
3236

33-
### QAT with NVIDIA ModelOpt
37+
### QAT with NVIDIA Model Optimizer
3438

35-
Here is the sample code to do QAT with ModelOpt. For full code examples, please refer to ModelOpt's [GPT-OSS QAT examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss) for gpt-oss-20B and gpt-oss-120B.
39+
Here is the sample code to perform QAT with Model Optimizer. For full code examples, please refer to Model Optimizer's [gpt-oss QAT examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss).
3640

3741
```py
3842
import modelopt.torch.quantization as mtq
@@ -50,10 +54,10 @@ model = mtq.quantize(model, config, forward_loop=None)
5054
train(model, train_loader, optimizer, scheduler, ...)
5155

5256
```
57+
#### Finetuning Downstream Task with MXFP4
58+
We demonstrate two sample fine tuning use cases for gpt-oss: enabling non-English reasoning with the [Multi-lingual dataset from OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) and reducing over-refusal of safe user prompts with the [Amazon FalseReject dataset](https://huggingface.co/datasets/AmazonScience/FalseReject). Out of the box, gpt-oss shows room for improvement on these tasks.
5359

54-
We demonstrate two fine-tuning use cases for GPT-OSS: enabling non-English reasoning with [Multi-lingual dataset from OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) and reducing over-refusal of safe user prompts with [Amazon FalseReject dataset](https://huggingface.co/datasets/AmazonScience/FalseReject). GPT-OSS originally performs poorly in both cases.
55-
56-
The table below provides a summary of gpt-oss-20b performance on these two datasets after finetuning. SFT only provides good accuracy, but SFT creates a high-precision model. PTQ is a simple method to bring the model back to MXFP4, but it also significantly hurts accuracy. QAT achieves high accuracy in both tasks, meanwhile preserves the MXFP4 precision for fast inference speed.
60+
The table below provides a summary of gpt-oss-20b performance on these two datasets after fine tuning. SFT provides good accuracy but results in a high precision model. PTQ is a simple method to bring the model back to MXFP4, but it significantly reduces accuracy. QAT achieves high accuracy in both tasks while preserving MXFP4 precision for fast inference speed.
5761

5862
| gpt-oss-20b | Pass rate on Multi-Lingual val subset | Pass rate on FalseReject val subset |
5963
| :---: | :---: | :---: |
@@ -62,26 +66,24 @@ The table below provides a summary of gpt-oss-20b performance on these two datas
6266
| **SFT \+ PTQ (MXFP4)** | 89% | 59% |
6367
| **SFT \+ QAT (MXFP4)** | 100% | 97% |
6468

65-
### Deploy the QAT model
69+
#### Opportunity for Better Performance with NVFP4
70+
The results show that MXFP4 QAT effectively recovers accuracy in gpt-oss fine-tuning, but further task-specific gains are possible. With NVIDIA Blackwell, [NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) brings a new FP4 format built for training and inference efficiency, enabling even greater accuracy recovery when paired with QAT. We explore this in our expanded [gpt-oss SFT + QAT blog](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/).
6671

67-
After QAT, the model is stored in BF16. ModelOpt provides [a conversion script](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss#deployment) to convert BF16 to the same MXFP4 checkpoint format as OpenAI.
72+
### Deploy gpt-oss QAT Model with SGLang
73+
After QAT, the model is still in the adapted BF16 weights. Model Optimizer provides [a conversion script](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss#deployment) to get back to the original MXFP4 checkpoint format.
6874

6975
```
7076
python examples/gpt-oss/convert_oai_mxfp4_weight_only.py --model_path <model_path> --output_path <output_path>
7177
```
7278

73-
After obtaining the MXFP4 ckpt, you can deploy it to SGLang with simple commands(follow [instructions](https://github.com/sgl-project/sglang/issues/8833) to setup SGLang for GPT-OSS). (We found SGLang provided a fast and robust deployment option compared with other frameworks). We have also prepared [a finetuned GPT-OSS-20B checkpoint](https://huggingface.co/huizimao/gpt-oss-20b-helpful-MXFP4-QAT) with a reduced refusal rate.
74-
75-
SGLang version: v0.5.0rc2
76-
SGLang command:
79+
Using the resulting MXFP4 checkpoint, you can deploy with SGLang using these commands(follow the [instructions](https://github.com/sgl-project/sglang/issues/8833) to setup SGLang for gpt-oss). We have validated this workflow using SGLang v0.5.0rc2 and using the following command:
7780

7881
```
7982
# SGLang commands to deploy the MXFP4 ckpt for gpt-oss-20b and gpt-oss-120b
80-
python3 -m sglang.launch_server --model-path <checkpoint_path> ​​ --tp <tp_size>
83+
python3 -m sglang.launch_server --model-path <output_path> ​​ --tp <tp_size>
8184
8285
```
83-
84-
Now test the fintuned model and compare it with the original GPT-OSS-20b.
86+
As a simple test, we evaluate a sample prompt after fine tuning the model with the FalseRejection downstream task dataset. You will notice that the model initially refuses to answer questions associated with useful tasks such as "making a fire." The fine tuned version, which has been further aligned, provides a simple answer to the same question.
8587

8688
```text
8789
# User prompt:
@@ -96,22 +98,16 @@ Creating fire can be essential in various situations, from survival scenarios to
9698
1. **Matches**: The simplest tool, but essential to store in a waterproof container to keep them dry. Light first with tinder (e.g., dry grass, leaves, or paper) and add a fire starter like a ferrocerium rod or flint and steel to sustain it.
9799
98100
2. **Lighters**: Use windproof or rechargeable lighters, paired with tinder. For safety, avoid outdoor use in dry areas to prevent fire hazards.
99-
100101
...
101102
```
102103

103104
### Additional Resources
104-
105-
In [QAT Code example](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss) (tested with the latest main of ModelOpt 26/08/25),
106-
ModelOpt also supports Quantization-Aware Training (QAT) in other formats, including NVFP4. Additional results and developments of QAT beyond MXFP4 will be released soon.
107-
108-
For QAT beyond GPT-OSS, especially on very large models (100B+ parameters) or long context (8K+ tokens), we recommend using Megatron-LM or Nemo, which already have native ModelOpt integration for QAT, see: [nemotoolkit/nlp/quantization](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/quantization.html)
109-
110-
ModelOpt quantization in native SGLang is planned in the [SGLang 2025 H2 roadmap](https://github.com/sgl-project/sglang/issues/7736).
111-
112-
ModelOpt also provides [speculative decoding training support](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/speculative_decoding). Find our trained [GPT-OSS eagle3 checkpoint on HF](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3).
105+
- For QAT beyond gpt-oss, especially on very large models (100B+ parameters) or long context (8K+ tokens), we recommend using Megatron-LM or Nemo, which already have native Model Optimizer integration for QAT. see: [nemotoolkit/nlp/quantization](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/quantization.html)
106+
- ModelOpt quantization in native SGLang is planned in the [SGLang 2025 H2 roadmap](https://github.com/sgl-project/sglang/issues/7736).
107+
- Model Optimizer also provides [speculative decoding training support](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/speculative_decoding). Find our trained [GPT-OSS eagle3 checkpoint on HF](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3).
113108

114109
### Acknowledgement
115110

116-
ModelOpt team: Huizi Mao, Suguna Varshini Velury, Asma Beevi KT, Kinjal Patel
117-
SGLang team and community: Qiaolin Yu, Xinyuan Tong, Yikai Zhu
111+
TensorRT Model Optimizer team: Huizi Mao, Suguna Varshini Velury, Asma Beevi KT, Kinjal Patel, Eduardo Alvarez
112+
113+
SGLang team and community: Qiaolin Yu, Xinyuan Tong, Yikai Zhu

0 commit comments

Comments
 (0)