You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPT-OSS, the first open-source model family from OpenAI's lab since GPT-2, demonstrates strong math, coding, and general capabilities even when compared with much larger models. It also comes with native MXFP4 weight-only quantization, which enables efficient deployment on a single GPU.
10
+
OpenAI recently released gpt-oss, the first opensource model family from OpenAI's lab since GPT-2. These models demonstrate strong math, coding, and general capabilities. Part of the model's uniqueness is that it was released in native MXFP4 weightonly quantization. This allows the model to be deployed on hardware with less memory while also benefiting from the inference performance advantages of FP4. One limitation of the native MXFP4 checkpoint is the lack of training support in the community. Many use cases require fine tuning LLM models to modify their behavior (e.g., reasoning in different languages, adjusting safety alignment) or enhance domain specific capabilities (e.g., function calling, SQL scripting). Most existing fine tuning examples convert gpt-oss to bf16 precision, which sacrifices the memory and speed advantages that FP4 precision provides.
10
11
11
-
However, a significant limitation of MXFP4 is the lack of training support in the community for GPT-OSS. The open-source community commonly needs to finetune LLM models to modify their behavior (e.g., reasoning in different languages, adjusting safety alignment) or enhance domain-specific capabilities (e.g., function calling, SQL scripting). Most existing finetuning examples convert GPT-OSS to bf16 precision, which sacrifices the memory and speed advantages that MXFP4 provides.
12
-
13
-
In this blog, we demonstrate how to finetune LLMs while preserving MXFP4 precision using Quantization-aware Training (QAT) in NVIDIA Model Optimizer, then deploy the resulting model with SGLang. Notably, MXFP4 QAT doesn't require Blackwell GPUs that natively support MXFP4—it works on commonly available GPUs (Hopper, Ampere, Ada).
12
+
In this blog, we demonstrate how to fine tune LLMs while preserving FP4 precision using Quantization Aware Training (QAT) in NVIDIA Model Optimizer. We then show how to deploy the resulting model with SGLang. Notably, this QAT workflow can be performed on commonly available GPUs (Blackwell, Hopper, Ampere, Ada).
14
13
15
14
### What is Quantization-Aware Training (QAT)
16
15
17
-
QAT is a training technique to recover model accuracy from quantization. We show above a simplified illustration of QAT. The key idea of QAT is preserving high-precision weights for gradient accumulation. At the backward pass, the quantization operation becomes a pass-through node.
16
+
QAT is a training technique to recover model accuracy from quantization (simple illustration below). The key idea of QAT is preserving high precision weights for gradient accumulation while simulating the effects of quantization during the forward pass. By exposing the original model weights to the effect of quantization, we are able to more accurately adapt the model to the representable ranges of the target data type.
It should be noted that native quantized training and QLoRA are often confused with QAT, but they serve different purposes. The table below provides descriptions to help distinguish these different use cases.
21
22
22
-
- Step 1 (Optional): Train/fine-tune the model in the original precision. This makes sure a good starting point before QAT.
23
-
- Step 2: Insert quantizer nodes into the model graph. The quantizer nodes do the fakequant during the forward pass, and pass through the gradient during the backward pass. This step is handled by ModelOpt.
24
-
- Step 3: Finetune the quantized model in the same way as the original model, with a reduced learning rate (1e-4 to 1e-5). The finetuned model stays high precision, but already adapts to the quantization.
25
-
- Step 4: Export the QAT model to a materialized quantized checkpoint and deploy.
|**QLoRA**| Reduces training memory for LoRA fine tuning. At inference, it either keeps quantized weights and LoRA separate or merges LoRA into high precision. weights. |
26
+
|**Native quantized training**| Enables efficient training and inference. Requires native hardware support. |
27
+
|**QAT**| Improves quantized inference accuracy. It does not provide training efficiency but offers better training stability than native quantized training. |
26
28
27
-
It should be noted that native quantized training and QLoRA are often confused with QAT, but they serve different purposes.
29
+
### QAT Fine-tuning Recipe for gpt-oss
30
+
The steps to perform QAT fine tuning are quite straightforward and can be completed in a few steps:
28
31
29
-
-**QLoRA** reduces training memory for LoRA finetuning. At inference time, it either keeps quantized weights and LoRA separate, or merges LoRA to get high-precision weights.
30
-
-**Native quantized training** enables efficient training and inference. Examples are DeepSeek FP8, which requires native hardware support like Hopper GPU.
31
-
-**QAT** empowers quantized inference with better accuracy. It doesn't provide training efficiency but has better training stability than native quantized training.
32
+
-**Step 1 (Optional)**: Fine tune the model in the original precision. This establishes a good starting point before QAT.
33
+
-**Step 2**: Insert quantizer nodes into the model graph. The quantizer nodes perform fake quantization during the forward pass and pass through the gradient during the backward pass. This step is handled by Model Optimizer.
34
+
-**Step 3**: Fine tune the quantized model in the same way as the original model, with a reduced learning rate (1e-4 to 1e-5). The fine tuned model stays high precision but uses QAT in this step.
35
+
-**Step 4**: Export the QAT quantized checkpoint and deploy.
32
36
33
-
### QAT with NVIDIA ModelOpt
37
+
### QAT with NVIDIA Model Optimizer
34
38
35
-
Here is the sample code to do QAT with ModelOpt. For full code examples, please refer to ModelOpt's [GPT-OSS QAT examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss) for gpt-oss-20B and gpt-oss-120B.
39
+
Here is the sample code to perform QAT with Model Optimizer. For full code examples, please refer to Model Optimizer's [gpt-oss QAT examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss).
36
40
37
41
```py
38
42
import modelopt.torch.quantization as mtq
@@ -50,10 +54,10 @@ model = mtq.quantize(model, config, forward_loop=None)
We demonstrate two sample fine tuning use cases for gpt-oss: enabling non-English reasoning with the [Multi-lingual dataset from OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) and reducing over-refusal of safe user prompts with the [Amazon FalseReject dataset](https://huggingface.co/datasets/AmazonScience/FalseReject). Out of the box, gpt-oss shows room for improvement on these tasks.
53
59
54
-
We demonstrate two fine-tuning use cases for GPT-OSS: enabling non-English reasoning with [Multi-lingual dataset from OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) and reducing over-refusal of safe user prompts with [Amazon FalseReject dataset](https://huggingface.co/datasets/AmazonScience/FalseReject). GPT-OSS originally performs poorly in both cases.
55
-
56
-
The table below provides a summary of gpt-oss-20b performance on these two datasets after finetuning. SFT only provides good accuracy, but SFT creates a high-precision model. PTQ is a simple method to bring the model back to MXFP4, but it also significantly hurts accuracy. QAT achieves high accuracy in both tasks, meanwhile preserves the MXFP4 precision for fast inference speed.
60
+
The table below provides a summary of gpt-oss-20b performance on these two datasets after fine tuning. SFT provides good accuracy but results in a high precision model. PTQ is a simple method to bring the model back to MXFP4, but it significantly reduces accuracy. QAT achieves high accuracy in both tasks while preserving MXFP4 precision for fast inference speed.
57
61
58
62
| gpt-oss-20b | Pass rate on Multi-Lingual val subset | Pass rate on FalseReject val subset |
59
63
| :---: | :---: | :---: |
@@ -62,26 +66,24 @@ The table below provides a summary of gpt-oss-20b performance on these two datas
62
66
|**SFT \+ PTQ (MXFP4)**| 89% | 59% |
63
67
|**SFT \+ QAT (MXFP4)**| 100% | 97% |
64
68
65
-
### Deploy the QAT model
69
+
#### Opportunity for Better Performance with NVFP4
70
+
The results show that MXFP4 QAT effectively recovers accuracy in gpt-oss fine-tuning, but further task-specific gains are possible. With NVIDIA Blackwell, [NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/) brings a new FP4 format built for training and inference efficiency, enabling even greater accuracy recovery when paired with QAT. We explore this in our expanded [gpt-oss SFT + QAT blog](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/).
66
71
67
-
After QAT, the model is stored in BF16. ModelOpt provides [a conversion script](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss#deployment) to convert BF16 to the same MXFP4 checkpoint format as OpenAI.
72
+
### Deploy gpt-oss QAT Model with SGLang
73
+
After QAT, the model is still in the adapted BF16 weights. Model Optimizer provides [a conversion script](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss#deployment) to get back to the original MXFP4 checkpoint format.
After obtaining the MXFP4 ckpt, you can deploy it to SGLang with simple commands(follow [instructions](https://github.com/sgl-project/sglang/issues/8833) to setup SGLang for GPT-OSS). (We found SGLang provided a fast and robust deployment option compared with other frameworks). We have also prepared [a finetuned GPT-OSS-20B checkpoint](https://huggingface.co/huizimao/gpt-oss-20b-helpful-MXFP4-QAT) with a reduced refusal rate.
74
-
75
-
SGLang version: v0.5.0rc2
76
-
SGLang command:
79
+
Using the resulting MXFP4 checkpoint, you can deploy with SGLang using these commands(follow the [instructions](https://github.com/sgl-project/sglang/issues/8833) to setup SGLang for gpt-oss). We have validated this workflow using SGLang v0.5.0rc2 and using the following command:
77
80
78
81
```
79
82
# SGLang commands to deploy the MXFP4 ckpt for gpt-oss-20b and gpt-oss-120b
Now test the fintuned model and compare it with the original GPT-OSS-20b.
86
+
As a simple test, we evaluate a sample prompt after fine tuning the model with the FalseRejection downstream task dataset. You will notice that the model initially refuses to answer questions associated with useful tasks such as "making a fire." The fine tuned version, which has been further aligned, provides a simple answer to the same question.
85
87
86
88
```text
87
89
# User prompt:
@@ -96,22 +98,16 @@ Creating fire can be essential in various situations, from survival scenarios to
96
98
1. **Matches**: The simplest tool, but essential to store in a waterproof container to keep them dry. Light first with tinder (e.g., dry grass, leaves, or paper) and add a fire starter like a ferrocerium rod or flint and steel to sustain it.
97
99
98
100
2. **Lighters**: Use windproof or rechargeable lighters, paired with tinder. For safety, avoid outdoor use in dry areas to prevent fire hazards.
99
-
100
101
...
101
102
```
102
103
103
104
### Additional Resources
104
-
105
-
In [QAT Code example](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/gpt-oss) (tested with the latest main of ModelOpt 26/08/25),
106
-
ModelOpt also supports Quantization-Aware Training (QAT) in other formats, including NVFP4. Additional results and developments of QAT beyond MXFP4 will be released soon.
107
-
108
-
For QAT beyond GPT-OSS, especially on very large models (100B+ parameters) or long context (8K+ tokens), we recommend using Megatron-LM or Nemo, which already have native ModelOpt integration for QAT, see: [nemotoolkit/nlp/quantization](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/quantization.html)
109
-
110
-
ModelOpt quantization in native SGLang is planned in the [SGLang 2025 H2 roadmap](https://github.com/sgl-project/sglang/issues/7736).
111
-
112
-
ModelOpt also provides [speculative decoding training support](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/speculative_decoding). Find our trained [GPT-OSS eagle3 checkpoint on HF](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3).
105
+
- For QAT beyond gpt-oss, especially on very large models (100B+ parameters) or long context (8K+ tokens), we recommend using Megatron-LM or Nemo, which already have native Model Optimizer integration for QAT. see: [nemotoolkit/nlp/quantization](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/quantization.html)
106
+
- ModelOpt quantization in native SGLang is planned in the [SGLang 2025 H2 roadmap](https://github.com/sgl-project/sglang/issues/7736).
107
+
- Model Optimizer also provides [speculative decoding training support](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/speculative_decoding). Find our trained [GPT-OSS eagle3 checkpoint on HF](https://huggingface.co/nvidia/gpt-oss-120b-Eagle3).
0 commit comments