|
1 | 1 | --- |
2 | 2 | title: 'Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL' |
3 | | -author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, miles Team, slime Team" |
4 | | -date: "November 24, 2025" |
| 3 | +author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, Miles Team" |
| 4 | +date: "November 25, 2025" |
5 | 5 | previewImg: /images/blog/fp8-rl/3_Megatron.png |
6 | 6 | --- |
7 | 7 |
|
8 | 8 | > TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training. |
9 | 9 |
|
10 | | -SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration: |
| 10 | +SGLang RL Team and the Miles community have conducted some interesting explorations around RL training stability and acceleration: |
11 | 11 |
|
12 | | -[Aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) for **strictly zero KL divergence** |
| 12 | +[Aligning the SGLang and FSDP backends](https://github.com/radixark/miles/tree/main/examples/true_on_policy) for **strictly zero KL divergence** |
13 | 13 |
|
14 | 14 | [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) with online SFT for the draft model |
15 | 15 |
|
16 | | -Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box. |
| 16 | +Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in miles](https://github.com/radixark/miles/tree/main/examples/low_precision) and is ready to use out of the box. |
17 | 17 |
|
18 | | -This work is jointly completed by the **InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and slime Team**. Special thanks to **DataCrunch** for compute sponsorship and to **NVIDIA** for technical support on Transformer Engine (TE). |
| 18 | +This work is jointly completed by the **InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and Miles Team**. Special thanks to **DataCrunch** for compute sponsorship and to **NVIDIA** for technical support on Transformer Engine (TE). |
19 | 19 |
|
20 | 20 | ## Hardware Foundations of FP8 Training |
21 | 21 |
|
@@ -158,7 +158,7 @@ Besides algorithmic challenges, there is room for improvement in how Megatron-Co |
158 | 158 |
|
159 | 159 | ## **FP8 + RL: Attributing Abnormal KL Loss** |
160 | 160 |
|
161 | | -The **InfiXAI Team** has already successfully run full FP8 training on **pre-training and fine-tuning tasks** (see [Pre-training and Fine-tuning](https://arxiv.org/html/2509.22536v4)). Building on this, we apply FP8 training to RL. Thanks to slime’s good support for Megatron FP8 training, we were able to run a series of FP8 RL experiments smoothly. |
| 161 | +The **InfiXAI Team** has already successfully run full FP8 training on **pre-training and fine-tuning tasks** (see [Pre-training and Fine-tuning](https://arxiv.org/html/2509.22536v4)). Building on this, we apply FP8 training to RL. Thanks to Miles' good support for Megatron FP8 training, we were able to run a series of FP8 RL experiments smoothly. |
162 | 162 |
|
163 | 163 | ### **Abnormal Initial KL Loss** |
164 | 164 |
|
@@ -339,7 +339,6 @@ Thank you for reading. We see several directions worth further exploration: |
339 | 339 |
|
340 | 340 | 1. InfiXAI Team: Congkai Xie, Mingfa Feng, Shuo Cai |
341 | 341 | 2. Ant Group AQ Team: Yanan Gao, Zhiling Ye, Hansong Xiao |
342 | | -3. SGLang RL Team: JiLi, Yefei Chen, Xi Chen |
343 | | -4. miles Team: Chenyang Zhao |
344 | | -5. slime Team: Zilin Zhu |
345 | | -6. NVIDIA: Juan Yu, NeMo-RL Team |
| 342 | +3. SGLang RL Team: JiLi, Yefei Chen, Xi Chen, Zilin Zhu |
| 343 | +4. Miles Team: Chenyang Zhao |
| 344 | +5. NVIDIA: Juan Yu, NeMo-RL Team |
0 commit comments