Skip to content

Conversation

@gmarciani
Copy link
Contributor

@gmarciani gmarciani commented Aug 7, 2025

This PR has been taken over from #2998
On top of it I needed to resolve conflicts and remove unused berksfiles.

Description of changes

  • Upgrade Slurm to version 24.11.6 (from 24.05.8).
  • Upgrade EFA installer to 1.42.0 (from 1.41.0).
    • Efa-driver: efa-2.15.3-1
    • Efa-config: efa-config-1.18-1
    • Efa-profile: efa-profile-1.7-1
    • Libfabric-aws: libfabric-aws-2.1.0-3
    • Rdma-core: rdma-core-57.0-1
    • Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
  • Upgrade Cinc Client to version to 18.4.12 from 18.2.7.
  • Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
  • Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
  • Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2.
  • Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
  • Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).

Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4) This is a new change in DCGM 4:

Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.

  Component packages are as follows:

      datacenter-gpu-manager-4-core

              Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product

      datacenter-gpu-manager-4-cuda11

              Provides the CUDA11-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-cuda12

              Provides the CUDA12-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-proprietary

              Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda11

              Provides CUDA11 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda12

              Provides CUDA12 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-development

              Provides files necessary for the development of downstream software dependent on the DCGM library

For ParallelCluster GPU health check use case, I verified that datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-cuda12 are the minimal set of packages we need to install. I verified this by running GPU health check manually on a GPU instance. Missing any would cause errors.

https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html

Tests

  • Build image on all OSes (except Rocky) have been tested. We will test Rocky after the PR is merged.

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

- Upgrade Slurm to version 24.11.6 (from 24.05.8).
- Upgrade EFA installer to 1.42.0 (from 1.41.0).
  - Efa-driver: efa-2.15.3-1
  - Efa-config: efa-config-1.18-1
  - Efa-profile: efa-profile-1.7-1
  - Libfabric-aws: libfabric-aws-2.1.0-3
  - Rdma-core: rdma-core-57.0-1
  - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
- Upgrade Cinc Client to version to 18.4.12 from 18.2.7.
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
- Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
- Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2.
- Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
- Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).

Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4)
This is a new change in DCGM 4:
```
Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.

  Component packages are as follows:

      datacenter-gpu-manager-4-core

              Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product

      datacenter-gpu-manager-4-cuda11

              Provides the CUDA11-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-cuda12

              Provides the CUDA12-specific binaries available through the DCGM open source product

      datacenter-gpu-manager-4-proprietary

              Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda11

              Provides CUDA11 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-proprietary-cuda12

              Provides CUDA12 binaries not distributed as part of the DCGM open source product

      datacenter-gpu-manager-4-development

              Provides files necessary for the development of downstream software dependent on the DCGM library

```
https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html

Signed-off-by: Hanwen <[email protected]>

Signed-off-by: Hanwen <[email protected]>
@gmarciani gmarciani requested review from a team as code owners August 7, 2025 14:42
@gmarciani gmarciani enabled auto-merge (rebase) August 7, 2025 14:43
@gmarciani gmarciani mentioned this pull request Aug 7, 2025
Copy link
Contributor

@hgreebe hgreebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding Berksfile.lock files. I thought we removed the Berkshelf: #2989

@gmarciani
Copy link
Contributor Author

Why are we adding Berksfile.lock files. I thought we removed the Berkshelf: #2989

Good point, I think are left over, validating....

@gmarciani
Copy link
Contributor Author

Greeen light to remove berksfile.lock: build image succeeded on AL2 even after removing the berksfile.lock.

@gmarciani gmarciani merged commit aecff90 into aws:develop Aug 8, 2025
28 of 30 checks passed
@gmarciani gmarciani deleted the wip/mgiacomo/3140upgrade-dependencies-0807-1 branch August 8, 2025 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants