Merge pull request #1092 from lin0303-siyuan/feat/fsdp-doc

PopSoda2002 · web-flow · commit 4eaee1e25b17 · 2025-12-17T16:39:18.000-08:00
Feat: add usage docs for fsdp
diff --git a/docs/en/get_started/usage.md b/docs/en/get_started/usage.md
@@ -6,7 +6,7 @@
 When using slime, parameters are primarily passed for the following purposes:
 
 1.  To allocate a portion of the GPUs in the cluster for training and another portion for inference.
-2.  To load Megatron for the training portion.
+2.  To load Megatron or FSDP for the training portion.
 3.  To load SGLang for the inference portion.
 4.  To configure the hyperparameters required for RL training.
 
@@ -309,4 +309,64 @@ In some customized Megatron implementations, special operations need to be perfo
 
   - `--custom-megatron-init-path`: Adds some initialization calls.
   - `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability.
-  - `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.
+  - `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.
+
+## How to Use FSDP
+
+slime also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/). 
+
+> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information, or automatic inference via `--use-hf-config-for-megatron`. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step.
+
+To run FSDP as the training backend, pass `--train-backend fsdp` to enable.
+
+### Parameters
+
+Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way.
+
+| Configuration Category | Megatron Parameter | FSDP Parameter | Description |
+| --- | --- | --- | --- |
+| **Model Loading**         | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) or `--use-hf-config-for-megatron` | `--hf-checkpoint` (Required)                           | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` |
+| **Tensor Parallel**       | `--tensor-model-parallel-size`                               | Coming Soon                                            |                                                              |
+| **Pipeline Parallel**     | `--pipeline-model-parallel-size`                             | Coming Soon                                            |                                                              |
+| **Expert Parallel**       | `--expert-model-parallel-size`                               | Coming Soon                                            |                                                              |
+| **Context Parallel**      | `--context-parallel-size`                                    | `--context-parallel-size`                              | Both support CP                                              |
+| **Initial Learning Rate** | `--lr`                                                       | `--lr`                                                 | Same parameter                                               |
+| **Learning Rate Decay**   | `--lr-decay-style` (linear/cosine etc.)                      | `--lr-decay-style`                     | Same parameter |
+| **Warmup**                | `--lr-warmup-iters` (steps)                                  | `--lr-warmup-iters`                   | Same parameter |
+| **Min Learning Rate**     | `--min-lr`                                                   | `--min-lr`                                  | Same parameter |
+| **Optimizer Type**        | `--optimizer` (adam/sgd etc.)                                | `--optimizer` (default adam)                           | Basically same                                               |
+| **Distributed Optimizer** | `--use-distributed-optimizer`                                | Built-in to FSDP                                       | FSDP uses distributed optimizer by default                   |
+| **Gradient Checkpoint**   | `--recompute-granularity`, `--recompute-method`              | `--gradient-checkpointing`                             | **FSDP**: Simplified to boolean switch                       |
+| **CPU Offload**           | Implemented via distributed optimizer                        | `--fsdp-cpu-offload`                                   | **FSDP**: Offload parameters/gradients/optimizer states to CPU |
+| **CPU Backend**           | Implemented via distributed optimizer | `--fsdp-cpu-backend`                                   | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled |
+| **Attention Backend**     | Decided by Megatron Core                                     | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace                     |
+| **Mixed Precision**       | `--fp16` or `--bf16`                                         | `--fp16` (bf16 inferred automatically)                 | Basically same                                               |
+| **Training Backend**      | Default or `--train-backend megatron`                        | `--train-backend fsdp` (Required)                      | Used to switch backend                                       |
+| **Config**      |                         | `--config`                     | **FSDP**: Set additional parameters for FSDP backend |
+
+### Quick Start
+
+```bash
+# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance
+# Download model weights (Qwen3-4B)
+hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
+
+# Download training dataset (dapo-math-17k)
+hf download --repo-type dataset zhuzilin/dapo-math-17k \
+  --local-dir /root/dapo-math-17k
+
+# Download evaluation dataset (aime-2024)
+hf download --repo-type dataset zhuzilin/aime-2024 \
+  --local-dir /root/aime-2024
+  
+# Clone code and install dependencies
+git clone https://github.com/THUDM/slime.git
+cd slime
+pip install -e .
+
+
+# FSDP does not require weight conversion, natively supports huggingface format
+# Enable reference model, train Qwen3-4B in colocate mode
+source /root/slime/scripts/run-qwen3-4B-fsdp.sh
+```
+
diff --git a/docs/zh/get_started/usage.md b/docs/zh/get_started/usage.md
@@ -5,7 +5,7 @@
 在使用 slime 时，传参主要是为了如下几件事：
 
 1. 把集群中一部分 GPU 分配做训练，一部分分配做推理；
-2. 训练的部分加载 megatron；
+2. 训练的部分加载 megatron或者FSDP；
 3. 推理部分加载 sglang；
 4. 配置 RL 训练需要的超参。
 
@@ -308,4 +308,62 @@ if __name__ == "__main__":
 
 - `--custom-megatron-init-path`：会增加一些 init 的调用；
 - `--custom-megatron-before-log-prob-hook-path`：会在计算 log prob 之前调用；
-- `--custom-megatron-before-train-step-hook-path`：会在每个训练步之前调用。可以考虑用这种方式混入特殊的训练 loss 之类的。
+- `--custom-megatron-before-train-step-hook-path`：会在每个训练步之前调用。可以考虑用这种方式混入特殊的训练 loss 之类的。
+
+## FSDP 使用方法
+
+slime 同样也支持FSDP2作为训练后端，可以参考[文档](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/fsdp/readme.md)。
+
+> FSDP 通过 `AutoModelForCausalLM.from_pretrained()` 自动读取所有架构信息，无需手动指定。Megatron 需要手动配置参数读取 model 架构信息，或者通过 `--use-hf-config-for-megatron` 实现自动推断， FSDP可以全部从 `config.json` 自动读取，可以直接避免权重格式转换步骤。
+
+可以通过在命令行传递 `--train-backend fsdp` 来启动 FSDP 作为训练后端。
+
+### 参数
+
+FSDP和Megatron后端支持的参数的对比如下表所示，接下来FSDP会有更多的支持。
+
+| 配置类别 | Megatron 参数 | FSDP 参数 | 说明 |
+| --- | --- | --- | --- |
+| **模型加载** | `--load` (Megatron checkpoint) + 架构参数 (`--num-layers`, `--hidden-size` 等) 或 `--use-hf-config-for-megatron` | `--hf-checkpoint` (必需) | **FSDP**: 直接使用 HuggingFace 格式，无需转换权重，通过 `AutoConfig` 自动推断架构 |
+| **张量并行** | `--tensor-model-parallel-size` | Coming Soon |  |
+| **流水线并行** | `--pipeline-model-parallel-size` | Coming Soon |  |
+| **专家并行** | `--expert-model-parallel-size` | Coming Soon |  |
+| **上下文并行** | `--context-parallel-size` | `--context-parallel-size` | 两者都支持 CP |
+| **初始学习率** | `--lr` | `--lr` | 参数相同 |
+| **学习率衰减** | `--lr-decay-style` (linear/cosine 等) | `--lr-decay-style` | 参数相同 |
+| **Warmup** | `--lr-warmup-iters` (步数) | `--lr-warmup-iters` | 参数相同 |
+| **最小学习率** | `--min-lr` | `--min-lr` | 参数相同 |
+| **优化器类型** | `--optimizer` (adam/sgd 等) | `--optimizer` (默认 adam) | 基本相同 |
+| **分布式优化器** | `--use-distributed-optimizer` | 内置于 FSDP | FSDP 默认使用分布式优化器 |
+| **梯度检查点** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: 简化为布尔开关 |
+| **CPU Offload** | 通过分布式优化器实现 | `--fsdp-cpu-offload` | **FSDP**: 将参数/梯度/优化器状态卸载到 CPU |
+| **CPU 后端** | 通过分布式优化器实现 | `--fsdp-cpu-backend` | **FSDP**: 指定CPU的后端并且当CPU offload时使用混合后端 |
+| **Attention 后端** | 由 Megatron Core 决定 | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: 直接透传给 HuggingFace |
+| **混合精度** | `--fp16` 或 `--bf16` | `--fp16` (bf16 自动推断) | 基本相同 |
+| **训练后端** | 默认或 `--train-backend megatron` | `--train-backend fsdp` (必需) | 用于切换后端 |
+| **参数配置** | | `--config` | **FSDP**: 为FSDP设置额外的参数 |
+
+### FSDP 一键启动
+
+```bash
+# 如果需要使用 WANDB，需要提前设置好环境变量 WANDB_API_KEY
+# 下载模型权重 (Qwen3-4B)
+hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
+
+# 下载训练数据集 (dapo-math-17k)
+hf download --repo-type dataset zhuzilin/dapo-math-17k \
+  --local-dir /root/dapo-math-17k
+
+# 下载评估数据集 (aime-2024)
+hf download --repo-type dataset zhuzilin/aime-2024 \
+  --local-dir /root/aime-2024
+  
+# 克隆代码并安装依赖
+git clone https://github.com/THUDM/slime.git
+cd slime
+pip install -e .
+
+
+# FSDP不用进行权重转换，native 支持 huggingface 格式
+# 开启 reference model，在 colocated 模式下训练 Qwen3-4B
+source /root/slime/scripts/run-qwen3-4B-fsdp.sh