Releases · LBANN/HPC-launcher

22 Jan 01:57

bvanessen

v1.0.4

0f65e74

v1.0.4 Latest

Latest

What's Changed

Improve the performance for ROCm 7 + Slingshot systems to find and use the AWS_OFI_NCCL plugin. This became the default plugin and deprecated the use of the AWS_OFI_RCCL plugin (which is still used for ROCm 6.x)
Added environment variables to force use of the libfabric interface provided by the AWS_OFI_*CCL plugin's and is required for high performance communication. Can be overridden by explicitly setting the NCCL_NET environment variable.
Added performance tuning flags for ROCm systems in PyTorch to use channel's last ordering, which aligns with the best practices for the ROCm / MIOpen libraries.

Full Changelog: v1.0.3...v1.0.4

Assets 2

08 Jan 21:39

bvanessen

v1.0.3

4c97b24

v1.0.3

v1.0.3 release:

Setup scripts to autorelease to PyPI

Update systems and bug fix autodetect (#57)

Fixed a bug in the autodetect GPU logic that overwrote the output in
the finally block attached to the try block. Added code to
auto-detect the ROCm version being used and set that to constrain the
version of amdsmi installed. Added definitions for more LLNL systems.
Added installation instructions.

Added the device_id initialization to the init_process_group call in
the torchrun-hpc trampoline.

Assets 2

31 Aug 11:59

bvanessen

v1.0.2

b44575d

v1.0.2

What's Changed

Fixed the version tag file. by @bvanessen in #54

Full Changelog: v1.0.1...v1.0.2

Contributors

bvanessen

Assets 2

31 Aug 11:50

bvanessen

v1.0.1

5ca955c

v1.0.1

What's Changed

Fixed license to use SPDX format. by @bvanessen in #53

Full Changelog: v1.0.0...v1.0.1

Contributors

bvanessen

Assets 2

27 Aug 17:56

bvanessen

v1.0.0

b479c41

v1.0.0

The HPC launcher repository contains a set of helpful scripts and Python bindings for launching PyTorch (torchrun), LBANN 2.0 (PyTorch-core), or generic scripts on multiple leadership-class HPC systems. There are optimized routines for FLUX, SLURM, and LSF launchers. Additionally, there are optimized environments for systems at known compute centers. Currently there are supported systems at:

LLNL Livermore Computing (LC)

There are two main entry points into HPC-Launcher from the cli: launch and torchrun-hpc. torchrun-hpc is intended as a replacement for torchrun, while launch is a generic interface for launching parallel jobs.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: LBANN/HPC-launcher

v1.0.4

What's Changed

Uh oh!

v1.0.3

Uh oh!

v1.0.2

What's Changed

Contributors

Uh oh!

v1.0.1

What's Changed

Contributors

Uh oh!

v1.0.0

Uh oh!