diff --git a/README.md b/README.md index eb6aa76a9..b8119a308 100644 --- a/README.md +++ b/README.md @@ -129,6 +129,14 @@ If your gpu count per node is not 8, adjust: in the SBATCH command section. + +## Debugging +### Troubleshooting Jobs that Timeout +If you encounter jobs that timeout, you'll need to debug them to identify the root cause. To help with this process, we've enabled Flight Recorder, a tool that continuously collects diagnostic information about your jobs. +When a job times out, Flight Recorder automatically generates dump files on every rank containing valuable debugging data. You can find these dump files in the `job.dump_folder` directory. +To learn how to analyze and diagnose issues using these logs, follow our step-by-step tutorial [link](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html). + + ## License This code is made available under [BSD 3 license](./LICENSE). However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, data, etc.