-
Notifications
You must be signed in to change notification settings - Fork 109
Upgrade dependencies #3000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade dependencies #3000
Conversation
- Upgrade Slurm to version 24.11.6 (from 24.05.8). - Upgrade EFA installer to 1.42.0 (from 1.41.0). - Efa-driver: efa-2.15.3-1 - Efa-config: efa-config-1.18-1 - Efa-profile: efa-profile-1.7-1 - Libfabric-aws: libfabric-aws-2.1.0-3 - Rdma-core: rdma-core-57.0-1 - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11 - Upgrade Cinc Client to version to 18.4.12 from 18.2.7. - Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2. - Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2. - Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2. - Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2. - Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1). Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4) This is a new change in DCGM 4: ``` Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case. Component packages are as follows: datacenter-gpu-manager-4-core Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product datacenter-gpu-manager-4-cuda11 Provides the CUDA11-specific binaries available through the DCGM open source product datacenter-gpu-manager-4-cuda12 Provides the CUDA12-specific binaries available through the DCGM open source product datacenter-gpu-manager-4-proprietary Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product datacenter-gpu-manager-4-proprietary-cuda11 Provides CUDA11 binaries not distributed as part of the DCGM open source product datacenter-gpu-manager-4-proprietary-cuda12 Provides CUDA12 binaries not distributed as part of the DCGM open source product datacenter-gpu-manager-4-development Provides files necessary for the development of downstream software dependent on the DCGM library ``` https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html Signed-off-by: Hanwen <[email protected]> Signed-off-by: Hanwen <[email protected]>
hgreebe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding Berksfile.lock files. I thought we removed the Berkshelf: #2989
Good point, I think are left over, validating.... |
|
Greeen light to remove berksfile.lock: build image succeeded on AL2 even after removing the berksfile.lock. |
This PR has been taken over from #2998
On top of it I needed to resolve conflicts and remove unused berksfiles.
Description of changes
Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4) This is a new change in DCGM 4:
For ParallelCluster GPU health check use case, I verified that datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-cuda12 are the minimal set of packages we need to install. I verified this by running GPU health check manually on a GPU instance. Missing any would cause errors.
https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html
Tests
References
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.