|
6 | 6 | When using slime, parameters are primarily passed for the following purposes: |
7 | 7 |
|
8 | 8 | 1. To allocate a portion of the GPUs in the cluster for training and another portion for inference. |
9 | | -2. To load Megatron for the training portion. |
| 9 | +2. To load Megatron or FSDP for the training portion. |
10 | 10 | 3. To load SGLang for the inference portion. |
11 | 11 | 4. To configure the hyperparameters required for RL training. |
12 | 12 |
|
@@ -309,4 +309,64 @@ In some customized Megatron implementations, special operations need to be perfo |
309 | 309 |
|
310 | 310 | - `--custom-megatron-init-path`: Adds some initialization calls. |
311 | 311 | - `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability. |
312 | | - - `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example. |
| 312 | + - `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example. |
| 313 | + |
| 314 | +## How to Use FSDP |
| 315 | + |
| 316 | +slime also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/). |
| 317 | + |
| 318 | +> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information, or automatic inference via `--use-hf-config-for-megatron`. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step. |
| 319 | +
|
| 320 | +To run FSDP as the training backend, pass `--train-backend fsdp` to enable. |
| 321 | +
|
| 322 | +### Parameters |
| 323 | +
|
| 324 | +Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way. |
| 325 | +
|
| 326 | +| Configuration Category | Megatron Parameter | FSDP Parameter | Description | |
| 327 | +| --- | --- | --- | --- | |
| 328 | +| **Model Loading** | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) or `--use-hf-config-for-megatron` | `--hf-checkpoint` (Required) | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` | |
| 329 | +| **Tensor Parallel** | `--tensor-model-parallel-size` | Coming Soon | | |
| 330 | +| **Pipeline Parallel** | `--pipeline-model-parallel-size` | Coming Soon | | |
| 331 | +| **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | | |
| 332 | +| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP | |
| 333 | +| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter | |
| 334 | +| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine etc.) | `--lr-decay-style` | Same parameter | |
| 335 | +| **Warmup** | `--lr-warmup-iters` (steps) | `--lr-warmup-iters` | Same parameter | |
| 336 | +| **Min Learning Rate** | `--min-lr` | `--min-lr` | Same parameter | |
| 337 | +| **Optimizer Type** | `--optimizer` (adam/sgd etc.) | `--optimizer` (default adam) | Basically same | |
| 338 | +| **Distributed Optimizer** | `--use-distributed-optimizer` | Built-in to FSDP | FSDP uses distributed optimizer by default | |
| 339 | +| **Gradient Checkpoint** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: Simplified to boolean switch | |
| 340 | +| **CPU Offload** | Implemented via distributed optimizer | `--fsdp-cpu-offload` | **FSDP**: Offload parameters/gradients/optimizer states to CPU | |
| 341 | +| **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled | |
| 342 | +| **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace | |
| 343 | +| **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same | |
| 344 | +| **Training Backend** | Default or `--train-backend megatron` | `--train-backend fsdp` (Required) | Used to switch backend | |
| 345 | +| **Config** | | `--config` | **FSDP**: Set additional parameters for FSDP backend | |
| 346 | + |
| 347 | +### Quick Start |
| 348 | + |
| 349 | +```bash |
| 350 | +# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance |
| 351 | +# Download model weights (Qwen3-4B) |
| 352 | +hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B |
| 353 | + |
| 354 | +# Download training dataset (dapo-math-17k) |
| 355 | +hf download --repo-type dataset zhuzilin/dapo-math-17k \ |
| 356 | + --local-dir /root/dapo-math-17k |
| 357 | + |
| 358 | +# Download evaluation dataset (aime-2024) |
| 359 | +hf download --repo-type dataset zhuzilin/aime-2024 \ |
| 360 | + --local-dir /root/aime-2024 |
| 361 | + |
| 362 | +# Clone code and install dependencies |
| 363 | +git clone https://github.com/THUDM/slime.git |
| 364 | +cd slime |
| 365 | +pip install -e . |
| 366 | + |
| 367 | + |
| 368 | +# FSDP does not require weight conversion, natively supports huggingface format |
| 369 | +# Enable reference model, train Qwen3-4B in colocate mode |
| 370 | +source /root/slime/scripts/run-qwen3-4B-fsdp.sh |
| 371 | +``` |
| 372 | + |
0 commit comments