This directory contains advanced examples demonstrating various fault tolerance features and training approaches in TorchFT beyond the basic train_ddp.py example in the README.
Each directory contains a README with more detailed instructions, as well as extensive documentation on the feature being showcased and how to interpret the outputs.
- DDP with proactive failure recovery: Demonstrates DDP with proactive failure recovery mode
- DiLoCo: Demonstrates Distributed Local Convergence training
- LocalSGD: Demonstrates Local SGD with periodic synchronization
- Live Checkpoint Recovery: Demonstrates live checkpoint recovery
After starting the lighthouse server by running:
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000You can cd into the example directory:
cd examples/[example_directory]and then launch the example with torchX with:
export QUICK_RUN=1
torchx runthe QUICK_RUN environment variable runs the examples for much less steps, and also uses a synthetic, rather than downloaded, dataset. It is useful for testing the examples quickly.
See the .torchxconfig file in each example directory for configuration details, and torchx.py and the torchX documentation to understand how DDP is being ran.