Skip to content

Commit 7ec7e63

Browse files
Revise blog on FP8-based RL training and sampling (#257)
Updated the blog post to enhance clarity and structure, emphasizing key points about training stability and acceleration in RL.
1 parent e85c238 commit 7ec7e63

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

blog/2025-11-19-fp8-rl.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ previewImg: /images/blog/fp8-rl/3_Megatron.png
99
1010
SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration:
1111

12-
1. In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
12+
1. [Aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) for **strictly zero KL divergence**
1313

14-
2. In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
14+
2. [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) with online SFT for the draft model
1515

1616
Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box.
1717

0 commit comments

Comments
 (0)