You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* When running under the flux scheduler in an allocation, the --queue
argument is not valid. Add a warning and guard to drop it when
appropriate.
* Added guards to avoid initializing torch distributed when running in a
single rank launch. Also added code to detect if a launch command is
inside of an existing allocation, and if so be able to report how many
ranks are available within the allocation. Update the torchrun-hpc
tests to ensure that if run inside of an allocation, skip the tests
when there are insufficient resources.
* Added support for LSF systems.
* Fixed which environment variable to use for number of nodes in a slurm
allocation.
* Fixed bug when command_line parameter isn't set.
* Changed the torchrun_hpc_stub.py file to the called
torchrun_hpc_trampoline.py. Moved the file to the torch directory.
The file is no longer renamed to avoid triggering spurious pytests
calls.
* Renamed flag run-from-dir to run-from-launch-dir.
* Added the no-launch-dir flag to avoid creating a timestamped directory
for launching the script. Instead it will create the launch script
and log files in the current working directory.
* Added a command line argument --save-hostlist to enable writing the
hostlist to the file. By default the launch script will no longer do
this.
* Enable verbose mode to save the hostlist.
* Apply suggestions from code review
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
* Changed the trampoline to always check if distributed PyTorch is
initialized and then destroy it at the end.
---------
Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
0 commit comments