Skip to content

Commit f3df3c8

Browse files
update with miles (#261)
Co-authored-by: zhaochenyang20 <[email protected]>
1 parent 68ec7e0 commit f3df3c8

File tree

1 file changed

+10
-11
lines changed

1 file changed

+10
-11
lines changed
Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
---
22
title: 'Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL'
3-
author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, miles Team, slime Team"
4-
date: "November 24, 2025"
3+
author: "InfiXAI Team, Ant Group AQ Team, SGLang RL Team, Miles Team"
4+
date: "November 25, 2025"
55
previewImg: /images/blog/fp8-rl/3_Megatron.png
66
---
77

88
> TL;DR: We have implemented fully FP8-based sampling and training in RL. Experiments show that for MoE models, the larger the model, the more severe the train–inference discrepancy becomes when using BF16 training with FP8 rollout. In contrast, using unified FP8 for both training and rollout effectively eliminates train–inference inconsistency caused by quantization error, improving both the speed and stability of RL training.
99
10-
SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration:
10+
SGLang RL Team and the Miles community have conducted some interesting explorations around RL training stability and acceleration:
1111

12-
[Aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) for **strictly zero KL divergence**
12+
[Aligning the SGLang and FSDP backends](https://github.com/radixark/miles/tree/main/examples/true_on_policy) for **strictly zero KL divergence**
1313

1414
[**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) with online SFT for the draft model
1515

16-
Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box.
16+
Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in miles](https://github.com/radixark/miles/tree/main/examples/low_precision) and is ready to use out of the box.
1717

18-
This work is jointly completed by the **InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and slime Team**. Special thanks to **DataCrunch** for compute sponsorship and to **NVIDIA** for technical support on Transformer Engine (TE).
18+
This work is jointly completed by the **InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and Miles Team**. Special thanks to **DataCrunch** for compute sponsorship and to **NVIDIA** for technical support on Transformer Engine (TE).
1919

2020
## Hardware Foundations of FP8 Training
2121

@@ -158,7 +158,7 @@ Besides algorithmic challenges, there is room for improvement in how Megatron-Co
158158

159159
## **FP8 + RL: Attributing Abnormal KL Loss**
160160

161-
The **InfiXAI Team** has already successfully run full FP8 training on **pre-training and fine-tuning tasks** (see [Pre-training and Fine-tuning](https://arxiv.org/html/2509.22536v4)). Building on this, we apply FP8 training to RL. Thanks to slime’s good support for Megatron FP8 training, we were able to run a series of FP8 RL experiments smoothly.
161+
The **InfiXAI Team** has already successfully run full FP8 training on **pre-training and fine-tuning tasks** (see [Pre-training and Fine-tuning](https://arxiv.org/html/2509.22536v4)). Building on this, we apply FP8 training to RL. Thanks to Miles' good support for Megatron FP8 training, we were able to run a series of FP8 RL experiments smoothly.
162162

163163
### **Abnormal Initial KL Loss**
164164

@@ -339,7 +339,6 @@ Thank you for reading. We see several directions worth further exploration:
339339

340340
1. InfiXAI Team: Congkai Xie, Mingfa Feng, Shuo Cai
341341
2. Ant Group AQ Team: Yanan Gao, Zhiling Ye, Hansong Xiao
342-
3. SGLang RL Team: JiLi, Yefei Chen, Xi Chen
343-
4. miles Team: Chenyang Zhao
344-
5. slime Team: Zilin Zhu
345-
6. NVIDIA: Juan Yu, NeMo-RL Team
342+
3. SGLang RL Team: JiLi, Yefei Chen, Xi Chen, Zilin Zhu
343+
4. Miles Team: Chenyang Zhao
344+
5. NVIDIA: Juan Yu, NeMo-RL Team

0 commit comments

Comments
 (0)