Skip to content

Commit 4eaee1e

Browse files
authored
Merge pull request #1092 from lin0303-siyuan/feat/fsdp-doc
Feat: add usage docs for fsdp
2 parents b1daf2b + 12b35ee commit 4eaee1e

File tree

2 files changed

+122
-4
lines changed

2 files changed

+122
-4
lines changed

docs/en/get_started/usage.md

Lines changed: 62 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
When using slime, parameters are primarily passed for the following purposes:
77

88
1. To allocate a portion of the GPUs in the cluster for training and another portion for inference.
9-
2. To load Megatron for the training portion.
9+
2. To load Megatron or FSDP for the training portion.
1010
3. To load SGLang for the inference portion.
1111
4. To configure the hyperparameters required for RL training.
1212

@@ -309,4 +309,64 @@ In some customized Megatron implementations, special operations need to be perfo
309309

310310
- `--custom-megatron-init-path`: Adds some initialization calls.
311311
- `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability.
312-
- `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.
312+
- `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.
313+
314+
## How to Use FSDP
315+
316+
slime also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/).
317+
318+
> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information, or automatic inference via `--use-hf-config-for-megatron`. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step.
319+
320+
To run FSDP as the training backend, pass `--train-backend fsdp` to enable.
321+
322+
### Parameters
323+
324+
Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way.
325+
326+
| Configuration Category | Megatron Parameter | FSDP Parameter | Description |
327+
| --- | --- | --- | --- |
328+
| **Model Loading** | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) or `--use-hf-config-for-megatron` | `--hf-checkpoint` (Required) | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` |
329+
| **Tensor Parallel** | `--tensor-model-parallel-size` | Coming Soon | |
330+
| **Pipeline Parallel** | `--pipeline-model-parallel-size` | Coming Soon | |
331+
| **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | |
332+
| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP |
333+
| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter |
334+
| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine etc.) | `--lr-decay-style` | Same parameter |
335+
| **Warmup** | `--lr-warmup-iters` (steps) | `--lr-warmup-iters` | Same parameter |
336+
| **Min Learning Rate** | `--min-lr` | `--min-lr` | Same parameter |
337+
| **Optimizer Type** | `--optimizer` (adam/sgd etc.) | `--optimizer` (default adam) | Basically same |
338+
| **Distributed Optimizer** | `--use-distributed-optimizer` | Built-in to FSDP | FSDP uses distributed optimizer by default |
339+
| **Gradient Checkpoint** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: Simplified to boolean switch |
340+
| **CPU Offload** | Implemented via distributed optimizer | `--fsdp-cpu-offload` | **FSDP**: Offload parameters/gradients/optimizer states to CPU |
341+
| **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled |
342+
| **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace |
343+
| **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same |
344+
| **Training Backend** | Default or `--train-backend megatron` | `--train-backend fsdp` (Required) | Used to switch backend |
345+
| **Config** | | `--config` | **FSDP**: Set additional parameters for FSDP backend |
346+
347+
### Quick Start
348+
349+
```bash
350+
# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance
351+
# Download model weights (Qwen3-4B)
352+
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
353+
354+
# Download training dataset (dapo-math-17k)
355+
hf download --repo-type dataset zhuzilin/dapo-math-17k \
356+
--local-dir /root/dapo-math-17k
357+
358+
# Download evaluation dataset (aime-2024)
359+
hf download --repo-type dataset zhuzilin/aime-2024 \
360+
--local-dir /root/aime-2024
361+
362+
# Clone code and install dependencies
363+
git clone https://github.com/THUDM/slime.git
364+
cd slime
365+
pip install -e .
366+
367+
368+
# FSDP does not require weight conversion, natively supports huggingface format
369+
# Enable reference model, train Qwen3-4B in colocate mode
370+
source /root/slime/scripts/run-qwen3-4B-fsdp.sh
371+
```
372+

docs/zh/get_started/usage.md

Lines changed: 60 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
在使用 slime 时,传参主要是为了如下几件事:
66

77
1. 把集群中一部分 GPU 分配做训练,一部分分配做推理;
8-
2. 训练的部分加载 megatron
8+
2. 训练的部分加载 megatron或者FSDP
99
3. 推理部分加载 sglang;
1010
4. 配置 RL 训练需要的超参。
1111

@@ -308,4 +308,62 @@ if __name__ == "__main__":
308308

309309
- `--custom-megatron-init-path`:会增加一些 init 的调用;
310310
- `--custom-megatron-before-log-prob-hook-path`:会在计算 log prob 之前调用;
311-
- `--custom-megatron-before-train-step-hook-path`:会在每个训练步之前调用。可以考虑用这种方式混入特殊的训练 loss 之类的。
311+
- `--custom-megatron-before-train-step-hook-path`:会在每个训练步之前调用。可以考虑用这种方式混入特殊的训练 loss 之类的。
312+
313+
## FSDP 使用方法
314+
315+
slime 同样也支持FSDP2作为训练后端,可以参考[文档](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/fsdp/readme.md)
316+
317+
> FSDP 通过 `AutoModelForCausalLM.from_pretrained()` 自动读取所有架构信息,无需手动指定。Megatron 需要手动配置参数读取 model 架构信息,或者通过 `--use-hf-config-for-megatron` 实现自动推断, FSDP可以全部从 `config.json` 自动读取,可以直接避免权重格式转换步骤。
318+
319+
可以通过在命令行传递 `--train-backend fsdp` 来启动 FSDP 作为训练后端。
320+
321+
### 参数
322+
323+
FSDP和Megatron后端支持的参数的对比如下表所示,接下来FSDP会有更多的支持。
324+
325+
| 配置类别 | Megatron 参数 | FSDP 参数 | 说明 |
326+
| --- | --- | --- | --- |
327+
| **模型加载** | `--load` (Megatron checkpoint) + 架构参数 (`--num-layers`, `--hidden-size` 等) 或 `--use-hf-config-for-megatron` | `--hf-checkpoint` (必需) | **FSDP**: 直接使用 HuggingFace 格式,无需转换权重,通过 `AutoConfig` 自动推断架构 |
328+
| **张量并行** | `--tensor-model-parallel-size` | Coming Soon | |
329+
| **流水线并行** | `--pipeline-model-parallel-size` | Coming Soon | |
330+
| **专家并行** | `--expert-model-parallel-size` | Coming Soon | |
331+
| **上下文并行** | `--context-parallel-size` | `--context-parallel-size` | 两者都支持 CP |
332+
| **初始学习率** | `--lr` | `--lr` | 参数相同 |
333+
| **学习率衰减** | `--lr-decay-style` (linear/cosine 等) | `--lr-decay-style` | 参数相同 |
334+
| **Warmup** | `--lr-warmup-iters` (步数) | `--lr-warmup-iters` | 参数相同 |
335+
| **最小学习率** | `--min-lr` | `--min-lr` | 参数相同 |
336+
| **优化器类型** | `--optimizer` (adam/sgd 等) | `--optimizer` (默认 adam) | 基本相同 |
337+
| **分布式优化器** | `--use-distributed-optimizer` | 内置于 FSDP | FSDP 默认使用分布式优化器 |
338+
| **梯度检查点** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: 简化为布尔开关 |
339+
| **CPU Offload** | 通过分布式优化器实现 | `--fsdp-cpu-offload` | **FSDP**: 将参数/梯度/优化器状态卸载到 CPU |
340+
| **CPU 后端** | 通过分布式优化器实现 | `--fsdp-cpu-backend` | **FSDP**: 指定CPU的后端并且当CPU offload时使用混合后端 |
341+
| **Attention 后端** | 由 Megatron Core 决定 | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: 直接透传给 HuggingFace |
342+
| **混合精度** | `--fp16``--bf16` | `--fp16` (bf16 自动推断) | 基本相同 |
343+
| **训练后端** | 默认或 `--train-backend megatron` | `--train-backend fsdp` (必需) | 用于切换后端 |
344+
| **参数配置** | | `--config` | **FSDP**: 为FSDP设置额外的参数 |
345+
346+
### FSDP 一键启动
347+
348+
```bash
349+
# 如果需要使用 WANDB,需要提前设置好环境变量 WANDB_API_KEY
350+
# 下载模型权重 (Qwen3-4B)
351+
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
352+
353+
# 下载训练数据集 (dapo-math-17k)
354+
hf download --repo-type dataset zhuzilin/dapo-math-17k \
355+
--local-dir /root/dapo-math-17k
356+
357+
# 下载评估数据集 (aime-2024)
358+
hf download --repo-type dataset zhuzilin/aime-2024 \
359+
--local-dir /root/aime-2024
360+
361+
# 克隆代码并安装依赖
362+
git clone https://github.com/THUDM/slime.git
363+
cd slime
364+
pip install -e .
365+
366+
367+
# FSDP不用进行权重转换,native 支持 huggingface 格式
368+
# 开启 reference model,在 colocated 模式下训练 Qwen3-4B
369+
source /root/slime/scripts/run-qwen3-4B-fsdp.sh

0 commit comments

Comments
 (0)