|
| 1 | +# Fault Tolerance Launcher Guide |
| 2 | + |
| 3 | +The `ft_launcher` is provided by `nvidia-resiliency-ext` (included in NeMo RL dependencies) and enables automatic fault tolerance and recovery for distributed training runs. |
| 4 | + |
| 5 | +## Key Arguments |
| 6 | + |
| 7 | +| Argument | Description | Example | |
| 8 | +|----------|-------------|---------| |
| 9 | +| `--ft-cfg-path` | Path to FT YAML config file | `examples/configs/ft_config.yaml` | |
| 10 | +| `--ft-rank-heartbeat-timeout` | Heartbeat timeout in seconds | `450` | |
| 11 | +| `--ft-initial-rank-heartbeat-timeout` | Initial timeout (longer for setup) | `1200` | |
| 12 | +| `--max-restarts` | Maximum number of restart attempts | `5` | |
| 13 | + |
| 14 | +## Basic Usage |
| 15 | + |
| 16 | +```bash |
| 17 | +uv run ft_launcher \ |
| 18 | + --ft-cfg-path examples/configs/ft_config.yaml \ |
| 19 | + --ft-rank-heartbeat-timeout 450 \ |
| 20 | + --ft-initial-rank-heartbeat-timeout 1200 \ |
| 21 | + --max-restarts 5 \ |
| 22 | + examples/run_grpo_math.py \ |
| 23 | + --config <your_config.yaml> |
| 24 | +``` |
| 25 | + |
| 26 | +## FT Config File (examples/configs/ft_config.yaml) |
| 27 | + |
| 28 | +```yaml |
| 29 | +fault_tolerance: |
| 30 | + initial_rank_heartbeat_timeout: 360 |
| 31 | + restart_policy: any-failed |
| 32 | +``` |
| 33 | +
|
| 34 | +## Important Notes |
| 35 | +
|
| 36 | +1. **Checkpointing**: Enable checkpointing for recovery to work: |
| 37 | + ```bash |
| 38 | + ++checkpointing.enabled=true |
| 39 | + ++checkpointing.checkpoint_dir=/path/to/checkpoints |
| 40 | + ++checkpointing.save_period=50 |
| 41 | + ``` |
| 42 | + |
| 43 | +2. **Timeouts**: Set `--ft-initial-rank-heartbeat-timeout` higher than `--ft-rank-heartbeat-timeout` to allow for model loading/setup time. |
| 44 | + |
| 45 | +3. **Restart Policy**: The `any-failed` restart policy will restart the entire job if any rank fails. Look for these log messages to identify when a restart occurs: |
| 46 | + |
| 47 | + ``` |
| 48 | + [ERROR] [ft_launcher...] failed (exitcode: 1) local_rank: 0 (pid: ...) of binary: ... |
| 49 | + [INFO] [ft_launcher...] [default] Worker group FAILED. 3/5 attempts left; will restart worker group |
| 50 | + [INFO] [ft_launcher...] Stopping workers... Timeout = 30 sec. |
| 51 | + [INFO] [ft_launcher...] The node '...' attempts to join the next round of the rendezvous '...'. |
| 52 | + [INFO] [ft_launcher...] The node '...' has joined round N of the rendezvous '...' as rank 0 in a world of size 1. |
| 53 | + ``` |
| 54 | + |
| 55 | + Key indicators: |
| 56 | + - `Worker group FAILED. X/Y attempts left` - shows a restart is happening and remaining attempts |
| 57 | + - `will restart worker group` - confirms restart is in progress |
| 58 | + - `has joined round N` - the round number increases with each restart |
0 commit comments