Skip to content

Commit 8fdf731

Browse files
Update miles blog in speculative decoding (#253)
* upd miles blog * Online SFT for MTP of Speculative Decoding * upd --------- Co-authored-by: zhaochenyang20 <[email protected]>
1 parent 1ddc11c commit 8fdf731

File tree

1 file changed

+40
-38
lines changed

1 file changed

+40
-38
lines changed

blog/2025-11-19-miles.md

Lines changed: 40 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -7,79 +7,81 @@ previewImg: /images/blog/miles/miles.jpg
77

88
> *A journey of a thousand miles is made one small step at a time.*
99
10-
We're excited to introduce Miles, an enterprise-facing reinforcement learning framework designed for large-scale MoE training and production workloads. This introductory chapter will be the beginning of a series of tech blogs.
10+
Today, we are releasing Miles, an enterprise-grade reinforcement learning framework tailored for large-scale MoE training and production workloads.
1111

12-
Miles is forked from slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE training runs. Building on slime’s foundation, Miles aims to deliver a smooth and controllable RL experience for teams that need reliability and scale in real-world deployments.
12+
Miles is built on top of slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE runs (including GLM-4.6). While slime proved that lightweight design works, Miles takes the next step: delivering the reliability, scale, and control needed for real-world enterprise deployments.
1313

14-
The GitHub link for Miles can be found [here](https://github.com/radixark/miles).
14+
GitHub: [radixark/miles](https://github.com/radixark/miles).
1515

16-
## 🧠 Starting Point: slime - A Lightweight and Customizable RL Framework
16+
## Why Miles?
1717

1818
Every mile of progress begins with one well-placed step - slime it is. As a very lightweight and customizable RL framework, slime has been growing popular across the community. It has also been battle-tested in large MoE training, where it is used to train GLM-4.6. slime comes with a few elegant design principles:
1919

20-
### Native to be performant
20+
### Open-to-Use Performance
2121

22-
Native, structured support of SGLang and Megatron's full optimization stack. Keeping pace with the fast evolution of inference and training frameworks.
22+
We provide native, structured support for SGLang and Megatron's full optimization stack, keeping pace with the rapid evolution of inference and training frameworks.
2323

24-
### Clear, clean modularity
24+
### Modular Design
2525

26-
Its key components—Algorithm / Data / Rollout / Eval—are fully decoupled, letting users plug in new agent types, reward functions, or sampling strategies with minimal change of lines.
26+
Key components—Algorithm, Data, Rollout, and Eval—are fully decoupled. You can plug in new agent types, reward functions, or sampling strategies with minimal code changes.
2727

28-
### Model scientist-friendly
28+
### Built for Researchers
2929

30-
Every abstraction is readable and designed to be hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without touching low-level code. Inference-only and training-only debugging are provided for fast diagnosis of failing runs.
30+
Every abstraction is readable and hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without digging into low-level code. We also provide inference-only and training-only debugging modes for fast diagnosis.
3131

32-
### Community-first
32+
### Community-Driven
3333

34-
slime evolved through real-world feedback from the LMSYS and SGLang communities. It embodies what open collaboration across research and engineering can achieve.
34+
slime evolved through real-world feedback from the LMSYS and SGLang communities, embodying what open collaboration across research and engineering can achieve.
3535

36-
## ⚙️ Momentum On the Way: What is Recently Implemented
36+
## What's New?
3737

38-
Miles builds on slime but focuses on new hardware (e.g. GB300), large-scale MoE RL, and production-grade stability. The following features have been recently added (we have also upstreamed most of them to slime):
38+
Miles builds on slime but focuses on new hardware (e.g., GB300), large-scale MoE RL, and production-grade stability. Recent additions include (most of which we've also upstreamed to slime):
3939

4040
### True On-Policy
4141

42-
Besides the existing determinist feature that the runs yield bitwise identical and repeatable results, we further support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via the infrastructure approach.
42+
Beyond deterministic inference (bitwise identical results), we now support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via an infrastructure approach.
4343

44-
- The mismatch between training and inference is reduced to exactly zero.
45-
- To implement it, we use Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and torch compile. We also align numeric operation details between training and inference.
44+
- We've eliminated the mismatch between training and inference to exactuly 0 kl divergence.
45+
- This uses Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and `torch.compile`. We also aligned numeric operations between training and inference.
4646

4747
<img src="https://raw.githubusercontent.com/THUDM/slime/refs/heads/main/examples/true_on_policy/src/train_rollout_abs_diff.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img>
4848

49-
5049
### Memory Improvements
5150

52-
In order to fully utilize the precious GPU memory for maximum performance without encountering OOM errors, we made updates such as the following:
51+
To maximize performance without hitting OOM errors, we've made several updates:
5352

54-
- Add propagation to avoid errors when benign OOM; implement memory margin to fix OOM from NCCL; fix FSDP excessive memory or OOM; support move-based and partial offloading; host peak memory saving.
53+
- Added error propagation to avoid crashes on benign OOMs.
54+
- Implemented memory margins to fix NCCL-related OOMs.
55+
- Fixed excessive memory usage in FSDP.
56+
- Added support for move-based and partial offloading, plus host peak memory savings.
5557

56-
### Speculative Training
58+
### Speculative Decoding with Online Draft Model Training
5759

58-
In RL, freezing the draft model prevents it from following the target model policy, reducing accept length and degrading speedup, so we perform [online SFT on the draft model](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) throughout RL.
60+
Freezing the draft model in RL prevents it from following the target model's policy, which reduces accept length and speedup. We now perform [online SFT on the draft model](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) throughout RL.
5961

60-
- Achieve 25%+ rollout speedup vs. frozen MTP, especially in the late training stage.
61-
- Support MTP with sequence packing + CP; Loss masks with proper edge-case handling; LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.
62+
- Achieves 25%+ rollout speedup vs. a frozen MTP, especially in late training stages.
63+
- Supports MTP with sequence packing + CP, loss masks with edge-case handling, LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.
6264

6365
<img src="https://raw.githubusercontent.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/refs/heads/main/rlhf/slime/spec/pic/overall-throughput.png" style="display:block; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 60%"></img>
6466

65-
### Miscellaneous Updates
67+
### Other Improvements
6668

67-
Enhance the FSDP training backend; allow deploying the rollout subsystem independently outside the framework; debug utilities such as more metrics, post-hoc analyzers, and enhancing profilers; gradually refactor the code to further enhance it; A formal mathematics (Lean) example is provided with [SFT/RL scripts](https://github.com/radixark/miles/tree/main/examples/formal_math/single_round).
69+
We've enhanced the FSDP training backend, allowed independent deployment of the rollout subsystem, and added more debug utilities (metrics, post-hoc analyzers, better profilers). We also included a formal mathematics (Lean) example with [SFT/RL scripts](https://github.com/radixark/miles/tree/main/examples/formal_math/single_round).
6870

69-
## 🚧 Towards the Future: Our Roadmap
71+
## Roadmap
7072

71-
For the future development of Miles, we will put together more efforts to support enterprise-grade RL training. This includes:
73+
We are committed to supporting enterprise-grade RL training. Upcoming efforts include:
7274

73-
- Large-scale MoE RL examples on new hardware, e.g., GB300.
74-
- Multi-modal training
75-
- Rollout accelerations
76-
- Compatible with SGLang spec v2 for better performance.
77-
- Advance speculative training support, like EAGLE3, multi-spec layer.
78-
- Resource allocation for balanced training & serving in large-scale async training
79-
- Elastic to GPU failures
75+
- Large-scale MoE RL examples on new hardware (e.g., GB300).
76+
- Multi-modal training support.
77+
- Rollout accelerations:
78+
- Compatibility with SGLang spec v2.
79+
- Advanced speculative decoding (e.g., EAGLE3, multi-spec layer).
80+
- Better resource allocation for balanced training & serving in large-scale async training.
81+
- Elasticity to GPU failures.
8082

81-
## 🤝 Thanks towards Our Community
83+
## Acknowledgment
8284

83-
Miles exists thanks to the slime authors and the broader (SGLang) RL community.
85+
Miles wouldn't exist without the slime authors and the broader SGLang RL community.
8486

85-
We invite researchers, startups, and enterprise teams alike to explore slime and Miles - whichever best fits your environment - and to be together with us to make reinforcement learning efficient and reliable. We'll hear from the community and actively work on Miles' future development, towards a production-ready training environment.
87+
We invite researchers, startups, and enterprise teams to explore slime and Miles—pick the one that fits your needs—and join us in making reinforcement learning efficient and reliable. We're listening to the community and actively working on Miles to build a production-ready training environment.

0 commit comments

Comments
 (0)