You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> *A journey of a thousand miles is made one small step at a time.*
9
9
10
-
We're excited to introduce Miles, an enterprise-facing reinforcement learning framework designed for large-scale MoE training and production workloads. This introductory chapter will be the beginning of a series of tech blogs.
10
+
Today, we are releasing Miles, an enterprise-grade reinforcement learning framework tailored for large-scale MoE training and production workloads.
11
11
12
-
Miles is forked from slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE training runs. Building on slime’s foundation, Miles aims to deliver a smooth and controllable RL experience for teams that need reliabilityand scale in real-world deployments.
12
+
Miles is built on top of slime, the lightweight RL framework that has quietly powered many of today’s post-training pipelines and large MoE runs (including GLM-4.6). While slime proved that lightweight design works, Miles takes the next step: delivering the reliability, scale, and control needed for real-world enterprise deployments.
13
13
14
-
The GitHub link for Miles can be found [here](https://github.com/radixark/miles).
## 🧠 Starting Point: slime - A Lightweight and Customizable RL Framework
16
+
## Why Miles?
17
17
18
18
Every mile of progress begins with one well-placed step - slime it is. As a very lightweight and customizable RL framework, slime has been growing popular across the community. It has also been battle-tested in large MoE training, where it is used to train GLM-4.6. slime comes with a few elegant design principles:
19
19
20
-
### Native to be performant
20
+
### Open-to-Use Performance
21
21
22
-
Native, structured support of SGLang and Megatron's full optimization stack. Keeping pace with the fast evolution of inference and training frameworks.
22
+
We provide native, structured support for SGLang and Megatron's full optimization stack, keeping pace with the rapid evolution of inference and training frameworks.
23
23
24
-
### Clear, clean modularity
24
+
### Modular Design
25
25
26
-
Its key components—Algorithm / Data / Rollout / Eval—are fully decoupled, letting users plug in new agent types, reward functions, or sampling strategies with minimal change of lines.
26
+
Key components—Algorithm, Data, Rollout, and Eval—are fully decoupled. You can plug in new agent types, reward functions, or sampling strategies with minimal code changes.
27
27
28
-
### Model scientist-friendly
28
+
### Built for Researchers
29
29
30
-
Every abstraction is readable and designed to be hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without touching low-level code. Inference-only and training-only debugging are provided for fast diagnosis of failing runs.
30
+
Every abstraction is readable and hackable. Algorithm researchers can modify importance sampling, rollout logic, or loss dynamics without digging into low-level code. We also provide inference-only and training-only debugging modes for fast diagnosis.
31
31
32
-
### Community-first
32
+
### Community-Driven
33
33
34
-
slime evolved through real-world feedback from the LMSYS and SGLang communities. It embodies what open collaboration across research and engineering can achieve.
34
+
slime evolved through real-world feedback from the LMSYS and SGLang communities, embodying what open collaboration across research and engineering can achieve.
35
35
36
-
## ⚙️ Momentum On the Way: What is Recently Implemented
36
+
## What's New?
37
37
38
-
Miles builds on slime but focuses on new hardware (e.g. GB300), large-scale MoE RL, and production-grade stability. The following features have been recently added (we have also upstreamed most of them to slime):
38
+
Miles builds on slime but focuses on new hardware (e.g., GB300), large-scale MoE RL, and production-grade stability. Recent additions include (most of which we've also upstreamed to slime):
39
39
40
40
### True On-Policy
41
41
42
-
Besides the existing determinist feature that the runs yield bitwise identical and repeatable results, we further support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via the infrastructure approach.
42
+
Beyond deterministic inference (bitwise identical results), we now support [true on-policy](https://github.com/THUDM/slime/tree/main/examples/true_on_policy) via an infrastructure approach.
43
43
44
-
-The mismatch between training and inference is reduced to exactly zero.
45
-
-To implement it, we use Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and torchcompile. We also align numeric operation details between training and inference.
44
+
-We've eliminated the mismatch between training and inference to exactuly 0 kl divergence.
45
+
-This uses Flash Attention 3, DeepGEMM, batch invariant kernels from Thinking Machines Lab, and `torch.compile`. We also aligned numeric operations between training and inference.
In order to fully utilize the precious GPU memory for maximum performance without encountering OOM errors, we made updates such as the following:
51
+
To maximize performance without hitting OOM errors, we've made several updates:
53
52
54
-
- Add propagation to avoid errors when benign OOM; implement memory margin to fix OOM from NCCL; fix FSDP excessive memory or OOM; support move-based and partial offloading; host peak memory saving.
53
+
- Added error propagation to avoid crashes on benign OOMs.
54
+
- Implemented memory margins to fix NCCL-related OOMs.
55
+
- Fixed excessive memory usage in FSDP.
56
+
- Added support for move-based and partial offloading, plus host peak memory savings.
55
57
56
-
### Speculative Training
58
+
### Speculative Decoding with Online Draft Model Training
57
59
58
-
In RL, freezing the draft model prevents it from following the target model policy, reducing accept length and degrading speedup, so we perform [online SFT on the draft model](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) throughout RL.
60
+
Freezing the draft model in RL prevents it from following the target model's policy, which reduces accept length and speedup. We now perform [online SFT on the draft model](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) throughout RL.
59
61
60
-
-Achieve 25%+ rollout speedup vs. frozen MTP, especially in the late training stage.
61
-
-Support MTP with sequence packing + CP; Loss masks with proper edge-case handling; LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.
62
+
-Achieves 25%+ rollout speedup vs. a frozen MTP, especially in late training stages.
63
+
-Supports MTP with sequence packing + CP, loss masks with edge-case handling, LM head/embedding gradient isolation, and Megatron↔SGLang weight syncing.
Enhance the FSDP training backend; allow deploying the rollout subsystem independently outside the framework; debug utilities such as more metrics, post-hoc analyzers, and enhancing profilers; gradually refactor the code to further enhance it; A formal mathematics (Lean) example is provided with [SFT/RL scripts](https://github.com/radixark/miles/tree/main/examples/formal_math/single_round).
69
+
We've enhanced the FSDP training backend, allowed independent deployment of the rollout subsystem, and added more debug utilities (metrics, post-hoc analyzers, better profilers). We also included a formal mathematics (Lean) example with [SFT/RL scripts](https://github.com/radixark/miles/tree/main/examples/formal_math/single_round).
68
70
69
-
## 🚧 Towards the Future: Our Roadmap
71
+
## Roadmap
70
72
71
-
For the future development of Miles, we will put together more efforts to support enterprise-grade RL training. This includes:
73
+
We are committed to supporting enterprise-grade RL training. Upcoming efforts include:
72
74
73
-
- Large-scale MoE RL examples on new hardware, e.g., GB300.
74
-
- Multi-modal training
75
-
- Rollout accelerations
76
-
-Compatible with SGLang spec v2 for better performance.
77
-
-Advance speculative training support, like EAGLE3, multi-spec layer.
78
-
-Resource allocation for balanced training & serving in large-scale async training
79
-
-Elastic to GPU failures
75
+
- Large-scale MoE RL examples on new hardware (e.g., GB300).
-Better resource allocation for balanced training & serving in large-scale async training.
81
+
-Elasticity to GPU failures.
80
82
81
-
## 🤝 Thanks towards Our Community
83
+
## Acknowledgment
82
84
83
-
Miles exists thanks to the slime authors and the broader (SGLang) RL community.
85
+
Miles wouldn't exist without the slime authors and the broader SGLang RL community.
84
86
85
-
We invite researchers, startups, and enterprise teams alike to explore slime and Miles - whichever best fits your environment - and to be together with us to make reinforcement learning efficient and reliable. We'll hear from the community and actively work on Miles' future development, towards a production-ready training environment.
87
+
We invite researchers, startups, and enterprise teams to explore slime and Miles—pick the one that fits your needs—and join us in making reinforcement learning efficient and reliable. We're listening to the community and actively working on Miles to build a production-ready training environment.
0 commit comments