Releases: LBANN/HPC-launcher
v1.0.4
What's Changed
- Improve the performance for ROCm 7 + Slingshot systems to find and use the AWS_OFI_NCCL plugin. This became the default plugin and deprecated the use of the AWS_OFI_RCCL plugin (which is still used for ROCm 6.x)
- Added environment variables to force use of the
libfabricinterface provided by the AWS_OFI_*CCL plugin's and is required for high performance communication. Can be overridden by explicitly setting theNCCL_NETenvironment variable. - Added performance tuning flags for ROCm systems in PyTorch to use channel's last ordering, which aligns with the best practices for the ROCm / MIOpen libraries.
Full Changelog: v1.0.3...v1.0.4
v1.0.3
v1.0.3 release:
Setup scripts to autorelease to PyPI
Update systems and bug fix autodetect (#57)
-
Fixed a bug in the autodetect GPU logic that overwrote the output in
the finally block attached to the try block. Added code to
auto-detect the ROCm version being used and set that to constrain the
version of amdsmi installed. Added definitions for more LLNL systems. -
Added installation instructions.
Added the device_id initialization to the init_process_group call in
the torchrun-hpc trampoline.
v1.0.2
v1.0.1
What's Changed
- Fixed license to use SPDX format. by @bvanessen in #53
Full Changelog: v1.0.0...v1.0.1
v1.0.0
The HPC launcher repository contains a set of helpful scripts and Python bindings for launching PyTorch (torchrun), LBANN 2.0 (PyTorch-core), or generic scripts on multiple leadership-class HPC systems. There are optimized routines for FLUX, SLURM, and LSF launchers. Additionally, there are optimized environments for systems at known compute centers. Currently there are supported systems at:
- LLNL Livermore Computing (LC)
There are two main entry points into HPC-Launcher from the cli: launch and torchrun-hpc. torchrun-hpc is intended as a replacement for torchrun, while launch is a generic interface for launching parallel jobs.