Skip to content

Commit ddb062b

Browse files
authored
Merge branch 'main' into ruit/dpo_lora
2 parents 5294b63 + 9b03ba1 commit ddb062b

File tree

8 files changed

+1935
-358
lines changed

8 files changed

+1935
-358
lines changed

docs/guides/ft-launcher-guide.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Fault Tolerance Launcher Guide
2+
3+
The `ft_launcher` is provided by `nvidia-resiliency-ext` (included in NeMo RL dependencies) and enables automatic fault tolerance and recovery for distributed training runs.
4+
5+
## Key Arguments
6+
7+
| Argument | Description | Example |
8+
|----------|-------------|---------|
9+
| `--ft-cfg-path` | Path to FT YAML config file | `examples/configs/ft_config.yaml` |
10+
| `--ft-rank-heartbeat-timeout` | Heartbeat timeout in seconds | `450` |
11+
| `--ft-initial-rank-heartbeat-timeout` | Initial timeout (longer for setup) | `1200` |
12+
| `--max-restarts` | Maximum number of restart attempts | `5` |
13+
14+
## Basic Usage
15+
16+
```bash
17+
uv run ft_launcher \
18+
--ft-cfg-path examples/configs/ft_config.yaml \
19+
--ft-rank-heartbeat-timeout 450 \
20+
--ft-initial-rank-heartbeat-timeout 1200 \
21+
--max-restarts 5 \
22+
examples/run_grpo_math.py \
23+
--config <your_config.yaml>
24+
```
25+
26+
## FT Config File (examples/configs/ft_config.yaml)
27+
28+
```yaml
29+
fault_tolerance:
30+
initial_rank_heartbeat_timeout: 360
31+
restart_policy: any-failed
32+
```
33+
34+
## Important Notes
35+
36+
1. **Checkpointing**: Enable checkpointing for recovery to work:
37+
```bash
38+
++checkpointing.enabled=true
39+
++checkpointing.checkpoint_dir=/path/to/checkpoints
40+
++checkpointing.save_period=50
41+
```
42+
43+
2. **Timeouts**: Set `--ft-initial-rank-heartbeat-timeout` higher than `--ft-rank-heartbeat-timeout` to allow for model loading/setup time.
44+
45+
3. **Restart Policy**: The `any-failed` restart policy will restart the entire job if any rank fails. Look for these log messages to identify when a restart occurs:
46+
47+
```
48+
[ERROR] [ft_launcher...] failed (exitcode: 1) local_rank: 0 (pid: ...) of binary: ...
49+
[INFO] [ft_launcher...] [default] Worker group FAILED. 3/5 attempts left; will restart worker group
50+
[INFO] [ft_launcher...] Stopping workers... Timeout = 30 sec.
51+
[INFO] [ft_launcher...] The node '...' attempts to join the next round of the rendezvous '...'.
52+
[INFO] [ft_launcher...] The node '...' has joined round N of the rendezvous '...' as rank 0 in a world of size 1.
53+
```
54+
55+
Key indicators:
56+
- `Worker group FAILED. X/Y attempts left` - shows a restart is happening and remaining attempts
57+
- `will restart worker group` - confirms restart is in progress
58+
- `has joined round N` - the round number increases with each restart

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,7 @@ guides/deepseek.md
219219
model-quirks.md
220220
guides/async-grpo.md
221221
guides/dtensor-tp-accuracy.md
222+
guides/ft-launcher-guide.md
222223
```
223224

224225
```{toctree}

examples/configs/ft_config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
fault_tolerance:
2+
initial_rank_heartbeat_timeout: 360
3+
restart_policy: any-failed

0 commit comments

Comments
 (0)