Skip to content

Commit 40a1026

Browse files
authored
[FR][doc] Update README with reference to Flight Recorder (#599)
Summary: Update readme with reference to the flight recorder tutorial to help users diagnose stuck jobs. Test Plan: none.
1 parent ce5a73e commit 40a1026

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,14 @@ If your gpu count per node is not 8, adjust:
129129

130130
in the SBATCH command section.
131131

132+
133+
## Debugging
134+
### Troubleshooting Jobs that Timeout
135+
If you encounter jobs that timeout, you'll need to debug them to identify the root cause. To help with this process, we've enabled Flight Recorder, a tool that continuously collects diagnostic information about your jobs.
136+
When a job times out, Flight Recorder automatically generates dump files on every rank containing valuable debugging data. You can find these dump files in the `job.dump_folder` directory.
137+
To learn how to analyze and diagnose issues using these logs, follow our step-by-step tutorial [link](https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html).
138+
139+
132140
## License
133141

134142
This code is made available under [BSD 3 license](./LICENSE). However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, data, etc.

0 commit comments

Comments
 (0)