Skip to content

Releases: LBANN/HPC-launcher

v1.0.4

22 Jan 01:57
0f65e74

Choose a tag to compare

What's Changed

  • Improve the performance for ROCm 7 + Slingshot systems to find and use the AWS_OFI_NCCL plugin. This became the default plugin and deprecated the use of the AWS_OFI_RCCL plugin (which is still used for ROCm 6.x)
  • Added environment variables to force use of the libfabric interface provided by the AWS_OFI_*CCL plugin's and is required for high performance communication. Can be overridden by explicitly setting the NCCL_NET environment variable.
  • Added performance tuning flags for ROCm systems in PyTorch to use channel's last ordering, which aligns with the best practices for the ROCm / MIOpen libraries.

Full Changelog: v1.0.3...v1.0.4

v1.0.3

08 Jan 21:39
4c97b24

Choose a tag to compare

v1.0.3 release:

Setup scripts to autorelease to PyPI

Update systems and bug fix autodetect (#57)

  • Fixed a bug in the autodetect GPU logic that overwrote the output in
    the finally block attached to the try block. Added code to
    auto-detect the ROCm version being used and set that to constrain the
    version of amdsmi installed. Added definitions for more LLNL systems.

  • Added installation instructions.

Added the device_id initialization to the init_process_group call in
the torchrun-hpc trampoline.

v1.0.2

31 Aug 11:59
b44575d

Choose a tag to compare

What's Changed

Full Changelog: v1.0.1...v1.0.2

v1.0.1

31 Aug 11:50
5ca955c

Choose a tag to compare

What's Changed

Full Changelog: v1.0.0...v1.0.1

v1.0.0

27 Aug 17:56
b479c41

Choose a tag to compare

The HPC launcher repository contains a set of helpful scripts and Python bindings for launching PyTorch (torchrun), LBANN 2.0 (PyTorch-core), or generic scripts on multiple leadership-class HPC systems. There are optimized routines for FLUX, SLURM, and LSF launchers. Additionally, there are optimized environments for systems at known compute centers. Currently there are supported systems at:

  • LLNL Livermore Computing (LC)

There are two main entry points into HPC-Launcher from the cli: launch and torchrun-hpc. torchrun-hpc is intended as a replacement for torchrun, while launch is a generic interface for launching parallel jobs.