Revise blog on FP8-based RL training and sampling (#257)

zhaochenyang20 · web-flow · commit 7ec7e63f9a7d · 2025-11-19T20:17:12.000-08:00
Updated the blog post to enhance clarity and structure, emphasizing key points about training stability and acceleration in RL.
diff --git a/blog/2025-11-19-fp8-rl.md b/blog/2025-11-19-fp8-rl.md
@@ -9,9 +9,9 @@ previewImg: /images/blog/fp8-rl/3_Megatron.png
 
 SGLang RL Team and the slime community have conducted some interesting explorations around RL training stability and acceleration:
 
-1. In terms of **training stability**, by [aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy), we achieve **strictly zero KL divergence** between the rollout and training processes on dense models, reaching perfect train–inference consistency.
+1. [Aligning the SGLang and FSDP backends](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) for **strictly zero KL divergence**
 
-2. In terms of **training acceleration**, we introduce [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) into the RL sampling pipeline, which can significantly speed up sampling under suitable configurations.
+2. [**Speculative Decoding**](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) with online SFT for the draft model
 
 Building on this, we now share a new progress that balances both stability and performance—**implementing an end-to-end FP8 pipeline for RL training and sampling**. FP8 RL training for Qwen3-4B and Qwen3-30B-A3B has been [fully supported in slime](https://github.com/THUDM/slime/tree/main/examples/low_precision) and is ready to use out of the box.