Skip to content

Releases: stackhpc/ansible-slurm-appliance

v2.5

09 Sep 23:15
bc0c66c
Compare
Choose a tag to compare

What's Changed

  • Refactor Pulp repo definitions and add more Pulp documentation by @wtripp180901 in #760
  • Fix incorrect use of "partition" in OpenTofu node group variable definitions by @sjpb in #771
  • Bump Pulp snapshots for RL 9.6 by @priteau in #772
  • Add support for setting server group in scheduler hints by @sjpb in #773
  • Bump CUDA to 13.0.1 and NVIDIA driver to 580.82.07 by @priteau in #776
  • Make CaaS specific role: persist_openhpc_secrets idempotent by @bertiethorpe in #774

Full Changelog: v2.4.1...v2.5

Images

Two new images are available:

RockyLinux 8: openhpc-RL8-250820-0800-767addd8
RockyLinux 9: openhpc-RL9-250908-2047-d90ebd0e

v2.4.1

04 Sep 09:11
cbf990a
Compare
Choose a tag to compare

What's Changed

  • Fix inventory parsing of cookiecutter env by @MoteHue in #768

Full Changelog: v2.4...v2.4.1

Images

No new images at this release, see https://github.com/stackhpc/ansible-slurm-appliance/releases/tag/v2.4.

v2.4

02 Sep 13:09
21ef880
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.3...v2.4

Images

No new images are provided at this release

v2.3

28 Aug 09:11
1335a07
Compare
Choose a tag to compare

Key Changes

Caution

Will cause merge conflicts with existing site environments. See PR #751 description for steps to resolve

What's Changed

All PRs:

  • Add LICENSE file by @priteau in #725
  • Bump DOCA to 2.9.3 by @priteau in #728
  • Support clusters with no outbound internet by @sjpb in #717
  • Add s-nail package by @priteau in #713
  • CI: Use GitHub token for Packer workflows by @priteau in #729
  • Fix detection of CUDA package version by @priteau in #731
  • Update production.md docs by @sjpb in #730
  • Allow using a specific FIP for a build VM by @sjpb in #734
  • Fix convenience variables for proxy by @sjpb in #735
  • Add support for topology-aware scheduling by @wtripp180901 in #737
  • Bump openstack cli to get working FIP commands by @sjpb in #740
  • Support passing freeipa server cert on client enrolment by @sjpb in #739
  • Deprecate ofed role by @sjpb in #741
  • Allow adding additional dnf repos to rewrite by @sjpb in #744
  • Add metadata service to no_proxy defaults by @sjpb in #745
  • Make rebuilding slurm optional for cuda builds by @sjpb in #746
  • Address various issues with production docs by @sjpb in #747
  • Update appliance to Rocky 9.6 + Update Lustre to 2.15.7 by @wtripp180901 in #699
  • Move HPL source download to fatimage build by @bertiethorpe in #743
  • Bump openhpc role by @sjpb in #690
  • Move cookiecutter Tofu to new site environment by @wtripp180901 in #751
  • Upload CI images to Leafcloud S3 for image syncing by @bertiethorpe in #752
  • Don't set topology plugin if topology group not enabled by @sjpb in #754
  • Fixup additional_nodes inventory group by @sjpb in #749
  • Allow specifying security groups for individual login groups by @wtripp180901 in #758
  • Allow setting volume type for node group extra_volumes parameter by @sjpb in #755
  • Add Rstudio, VSCode, Matlab to OOD application catalogue by @bertiethorpe in #738
  • Allow setting config_drive by @wtripp180901 in #756
  • Add option for additional user data by @wtripp180901 in #757
  • Configure process tracking and accounting to use cgroups by @bertiethorpe in #762
  • Bump CUDA to 13.0 and NVIDIA driver to 580 by @priteau in #764
  • Fix parse error in pingmatrix output by @oneswig in #761

Full Changelog: v2.2...v2.3

New Contributors

Images

Two new images are available:

RockyLinux 8: openhpc-RL8-250808-1727-faa44755
RockyLinux 9: openhpc-RL9-250808-1727-faa44755

v2.2

01 Jul 10:13
4627526
Compare
Choose a tag to compare

What's Changed

  • Support additional nodegroups by @sjpb in #704
  • Make dnf repos available during site post-hook by @sjpb in #720
  • Restore ability to leave dnf repos enabled by @sjpb in #721
  • Bump OFED and DOCA to latest 24.10 LTS release by @priteau in #712
  • Fix cuda install when building without lustre first by @sjpb in #724

Full Changelog: v2.1...v2.2

Images

No new images are provided at this release.

v2.1

25 Jun 11:23
933dcf4
Compare
Choose a tag to compare

What's Changed

  • Pass templated fqdn to ansible by @sjpb in #702
  • Fix NHC downing nodes due to /efi being mounted twice by @sjpb in #719

Full Changelog: v2.0...v2.1

Images

There are no new images at this release.

v2.0

24 Jun 16:01
5f2f931
Compare
Choose a tag to compare

Key Changes

  • ⚠️ BACKWARDS-INCOMPATIBLE CHANGE ⚠️ To allow adding support for Multi-Instance GPUs in #656, the variable openhpc_slurm_partitions has been replaced by openhpc_nodegroups and openhpc_partitions. See #666 for full details and examples.
  • The tuned profile hpc-compute is now usable for nodes with hugepages enabled. See #672.
  • Support for LBNL's Node Health Checks added in #654. See ansible/roles/nhc/README.md
  • Support for configuring Multi-Instance GPUs (MIG) added in #656 - see docs/mig.md for full details.
  • Changes to the stackhpc.openhpc role for the above also mean that parameters can be removed from slurm.conf. See stackhpc/ansible-role-openhpc#184.
  • NVIDIA (open) drivers were upgraded to version 575.57.08 with cuda 12.9.1 in #703.
  • Lustre bug LU-18085 affecting Rocky Linux 9 was fixed in #688.
  • The appliance will now raise an error if Ansible Galaxy installs do not match requirements.yml, in #700
  • OS packages were updated for both Rocky Linux 8 and 9 in #707, with the latter now having the last updates for Rocky Linux 9.5.

What's Changed

All PRs, oldest first:

  • Fix typo in comment by @priteau in #675
  • Update appliance for stackhpc.openhpc nodegroup/partition changes by @sjpb in #666
  • Bump CUDA to 12.9 and NVIDIA driver to 575 by @priteau in #687
  • Fix environment creation from skeleton by @priteau in #682
  • Make home volume creation optional by @sjpb in #673
  • Add a simple index in the docs README. by @MoteHue in #669
  • Make packer var image_disk_format functional by @sjpb in #694
  • Remove description of un-implemented dummy interface/default route by @sjpb in #689
  • Allow specifying instance root volume type by @sjpb in #693
  • Add fix for Lustre bug LU-18085 by @sjpb in #688
  • Fix docs for cacerts role to mention cacerts_cert_dir by @technowhizz in #696
  • Update operations.md typo by @technowhizz in #697
  • Change host definition for cacert play to allow builder by @technowhizz in #698
  • Add validation for OpenTofu compute and login variables by @sjpb in #674
  • Remove incorrect note re ondemand for demo deployment by @sjpb in #695
  • Fix nvidia build at open driver version 575.57.08 with cuda 12.9.1 by @jovial in #703
  • Fix tuned hpc-compute with hugepages and verify applied profile by @sjpb in #672
  • Support fixed IP addresses for nodes by @priteau in #643
  • Ensure Ansible Galaxy installs are up to date by @sjpb in #700
  • Bump all Pulp snapshots to latest versions in RL 8.x, RL 9.5 by @priteau in #707
  • Add support for Node Health Checks by @sjpb in #654
  • Add support for configuring Multi-Instance GPUs (MIG) by @jovial in #656
  • Bump fatimage after MIG PR656 merge by @sjpb in #716

Full Changelog: v1.161...test

Images

Two new images are available:

  • RL8: openhpc-RL8-250624-0854-75099868
  • RL9: openhpc-RL9-250624-0854-75099868

v1.161

15 May 16:17
980f108
Compare
Choose a tag to compare

What's Changed

Bumps slurm versions to fix CVE-2025-43904:

  • Upgrade to OpenHPC/Slurm versions RL9=3.1.1/24.11.5 RL8=2.9.1/23.11.11 by @sjpb in #668
  • Perform Slurm database upgrade if necessary by @sjpb in #670
  • Automate image release by @sjpb in #671

Caution

This is a Slurm major version update for RockyLinux 9 (= OpenHPC v3) clusters.

These clusters will perform a Slurm database upgrade on slurmdbd startup. They will backup the entire state volume via a volume snapshot before performing the backup. See #670 and linked dependency PRs for full information.

Full Changelog: v1.160...v1.161

Images

Two new images are available:

  • RockyLinux 8: openhpc-RL8-250514-1502-5a923b2c
  • RockyLinux 9: openhpc-RL9-250514-1502-5a923b2c

v1.161-rc1

14 May 10:08
5a7608b
Compare
Choose a tag to compare
v1.161-rc1 Pre-release
Pre-release

What's Changed

Bumps slurm versions to fix CVE-2025-43904:

  • Upgrade to OpenHPC/Slurm versions RL9=3.1.1/24.11.5 RL8=2.9.1/23.11.11 by @sjpb in #668

Caution

This is a Slurm major version update for RockyLinux 9 (= OpenHPC v3) clusters.

These clusters will perform a Slurm database upgrade on slurmdbd startup. The startup timeout for that service has been increased to 45 minutes to allow for that. However it is recommended that this database (in /var/lib/state/mysql on the control node) is backed-up before starting slurmdbd, for example by snapshotting the $CLUSTER_NAME-state volume after the reimage (so the service is stopped) but before running the site.yml playbook.

Full Changelog: v1.160...v1.161

Images

Two new images are available:

  • RockyLinux 8: openhpc-RL8-250513-1045-ca44f898
  • RockyLinux 9: openhpc-RL9-250513-1046-ca44f898

v1.160

09 May 13:10
01b5aa8
Compare
Choose a tag to compare

What's Changed

  • Allow enabling package installs for caas clusters via extravars by @sjpb in #667

Full Changelog: v1.159...v1.160

There are no new images for this release, see v1.159.