Skip to content

Commit 946fdc7

Browse files
hanwen-clustergmarciani
authored andcommitted
Upgrade dependencies
- Upgrade Slurm to version 24.11.6 (from 24.05.8). - Upgrade EFA installer to 1.42.0 (from 1.41.0). - Efa-driver: efa-2.15.3-1 - Efa-config: efa-config-1.18-1 - Efa-profile: efa-profile-1.7-1 - Libfabric-aws: libfabric-aws-2.1.0-3 - Rdma-core: rdma-core-57.0-1 - Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11 - Upgrade Cinc Client to version to 18.4.12 from 18.2.7. - Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2. - Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2. - Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2. - Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2. - Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1). Among the above upgrade, DCGM is a major version upgrade (from version 3 to version 4) This is a new change in DCGM 4: ``` Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.   Component packages are as follows:       datacenter-gpu-manager-4-core               Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product       datacenter-gpu-manager-4-cuda11               Provides the CUDA11-specific binaries available through the DCGM open source product       datacenter-gpu-manager-4-cuda12               Provides the CUDA12-specific binaries available through the DCGM open source product       datacenter-gpu-manager-4-proprietary               Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product       datacenter-gpu-manager-4-proprietary-cuda11               Provides CUDA11 binaries not distributed as part of the DCGM open source product       datacenter-gpu-manager-4-proprietary-cuda12               Provides CUDA12 binaries not distributed as part of the DCGM open source product       datacenter-gpu-manager-4-development               Provides files necessary for the development of downstream software dependent on the DCGM library ``` https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html Signed-off-by: Hanwen <[email protected]> Signed-off-by: Hanwen <[email protected]>
1 parent 0993d8c commit 946fdc7

File tree

22 files changed

+322
-71
lines changed

22 files changed

+322
-71
lines changed

CHANGELOG.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,20 @@ This file is used to list changes made in each version of the AWS ParallelCluste
1111

1212
**CHANGES**
1313
- Ubuntu 20.04 is no longer supported.
14-
- Upgrade Slurm to version 24.11.5.
14+
- Upgrade Slurm to version 24.11.6 (from 24.05.8).
15+
- Upgrade EFA installer to 1.42.0 (from 1.41.0).
16+
- Efa-driver: efa-2.15.3-1
17+
- Efa-config: efa-config-1.18-1
18+
- Efa-profile: efa-profile-1.7-1
19+
- Libfabric-aws: libfabric-aws-2.1.0-3
20+
- Rdma-core: rdma-core-57.0-1
21+
- Open MPI: openmpi40-aws-4.1.7-2 and openmpi50-aws-5.0.6-11
22+
- Upgrade Cinc Client to version to 18.4.12 from 18.2.7.
23+
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
24+
- Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
25+
- Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2.
26+
- Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
27+
- Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).
1528
- Addressed cluster id mismatch known issue by deleting the file `/var/spool/slurm.state/clustername` before configuring Slurm accounting.
1629
- Upgrade DCV to version 2024.0-19030.
1730
- Remove `berkshelf`. All cookbooks are local and do not need `berkshelf` dependency management.

cookbooks/aws-parallelcluster-awsbatch/test/controls/awsbatch_virtualenv_spec.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
pyenv_dir = "#{base_dir}/pyenv"
1414

1515
control 'tag:install_awsbatch_virtualenv_created' do
16-
python_version = os_properties.alinux2? ? '3.9.20' : '3.12.8'
16+
python_version = os_properties.alinux2? ? '3.9.20' : '3.12.11'
1717
title "awsbatch virtualenv should be created on #{python_version}"
1818
only_if { !os_properties.redhat? }
1919

cookbooks/aws-parallelcluster-computefleet/kitchen.computefleet-config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ suites:
3131
attributes:
3232
cluster:
3333
custom_node_package: https://github.com/aws/aws-parallelcluster-node/archive/develop.tar.gz
34-
python-version: 3.12.8
34+
python-version: 3.12.11
3535
node_virtualenv_path: /opt/parallelcluster/pyenv/versions/node_virtualenv
3636
- name: fleet_status
3737
run_list:
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
DEPENDENCIES
2+
aws-parallelcluster-awsbatch
3+
path: ../aws-parallelcluster-awsbatch
4+
aws-parallelcluster-computefleet
5+
path: ../aws-parallelcluster-computefleet
6+
aws-parallelcluster-entrypoints
7+
path: .
8+
metadata: true
9+
aws-parallelcluster-environment
10+
path: ../aws-parallelcluster-environment
11+
aws-parallelcluster-platform
12+
path: ../aws-parallelcluster-platform
13+
aws-parallelcluster-shared
14+
path: ../aws-parallelcluster-shared
15+
aws-parallelcluster-slurm
16+
path: ../aws-parallelcluster-slurm
17+
aws-parallelcluster-tests
18+
path: ../aws-parallelcluster-tests
19+
iptables
20+
path: ../third-party/iptables-8.0.0
21+
line
22+
path: ../third-party/line-4.5.21
23+
nfs
24+
path: ../third-party/nfs-5.1.5
25+
openssh
26+
path: ../third-party/openssh-2.11.14
27+
yum
28+
path: ../third-party/yum-7.4.20
29+
yum-epel
30+
path: ../third-party/yum-epel-5.0.8
31+
32+
GRAPH
33+
aws-parallelcluster-awsbatch (3.13.0)
34+
aws-parallelcluster-shared (~> 3.13.0)
35+
iptables (~> 8.0.0)
36+
line (~> 4.5.21)
37+
nfs (~> 5.1.5)
38+
openssh (~> 2.11.14)
39+
yum (~> 7.4.20)
40+
yum-epel (~> 5.0.8)
41+
aws-parallelcluster-computefleet (3.13.0)
42+
aws-parallelcluster-shared (~> 3.13.0)
43+
aws-parallelcluster-entrypoints (3.13.0)
44+
aws-parallelcluster-awsbatch (~> 3.13.0)
45+
aws-parallelcluster-computefleet (~> 3.13.0)
46+
aws-parallelcluster-environment (~> 3.13.0)
47+
aws-parallelcluster-platform (~> 3.13.0)
48+
aws-parallelcluster-shared (~> 3.13.0)
49+
aws-parallelcluster-slurm (~> 3.13.0)
50+
aws-parallelcluster-environment (3.13.0)
51+
aws-parallelcluster-shared (~> 3.13.0)
52+
line (~> 4.5.21)
53+
nfs (~> 5.1.5)
54+
aws-parallelcluster-platform (3.13.0)
55+
aws-parallelcluster-shared (~> 3.13.0)
56+
line (~> 4.5.21)
57+
aws-parallelcluster-shared (3.13.0)
58+
yum (~> 7.4.20)
59+
yum-epel (~> 5.0.8)
60+
aws-parallelcluster-slurm (3.13.0)
61+
aws-parallelcluster-computefleet (~> 3.13.0)
62+
aws-parallelcluster-environment (~> 3.13.0)
63+
aws-parallelcluster-platform (~> 3.13.0)
64+
aws-parallelcluster-shared (~> 3.13.0)
65+
iptables (~> 8.0.0)
66+
line (~> 4.5.21)
67+
nfs (~> 5.1.5)
68+
openssh (~> 2.11.14)
69+
yum (~> 7.4.20)
70+
yum-epel (~> 5.0.8)
71+
aws-parallelcluster-tests (3.13.0)
72+
aws-parallelcluster-computefleet (~> 3.13.0)
73+
aws-parallelcluster-environment (~> 3.13.0)
74+
aws-parallelcluster-platform (~> 3.13.0)
75+
aws-parallelcluster-shared (~> 3.13.0)
76+
aws-parallelcluster-slurm (~> 3.13.0)
77+
iptables (8.0.0)
78+
line (4.5.21)
79+
nfs (5.1.5)
80+
line (>= 0.0.0)
81+
openssh (2.11.14)
82+
iptables (>= 7.0)
83+
yum (7.4.20)
84+
yum-epel (5.0.8)
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
DEPENDENCIES
2+
aws-parallelcluster-environment
3+
path: .
4+
metadata: true
5+
aws-parallelcluster-shared
6+
path: ../aws-parallelcluster-shared
7+
line
8+
path: ../third-party/line-4.5.21
9+
nfs
10+
path: ../third-party/nfs-5.1.5
11+
yum
12+
path: ../third-party/yum-7.4.20
13+
yum-epel
14+
path: ../third-party/yum-epel-5.0.8
15+
16+
GRAPH
17+
aws-parallelcluster-environment (3.13.0)
18+
aws-parallelcluster-shared (~> 3.13.0)
19+
line (~> 4.5.21)
20+
nfs (~> 5.1.5)
21+
aws-parallelcluster-shared (3.13.0)
22+
yum (~> 7.4.20)
23+
yum-epel (~> 5.0.8)
24+
line (4.5.21)
25+
nfs (5.1.5)
26+
line (>= 0.0.0)
27+
yum (7.4.20)
28+
yum-epel (5.0.8)

cookbooks/aws-parallelcluster-environment/attributes/environment.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,8 @@
7070

7171
default['cluster']['head_node_private_ip'] = nil
7272

73-
default['cluster']['efa']['version'] = '1.41.0'
74-
default['cluster']['efa']['sha256'] = '3506354cdfbe31ff552fe75f5d0d9bb7efd29cf79bd99457347d29c751c38f9f'
73+
default['cluster']['efa']['version'] = '1.42.0'
74+
default['cluster']['efa']['sha256'] = '4114fe612905ee05083ae5cb391a00a012510f3abfecc642d86c9a5ae4be9008'
7575

7676
default['cluster']['efs']['version'] = '2.3.1'
7777
default['cluster']['efs']['sha256'] = 'ced12f82e76f9740476b63f30c49bd76cc00b6375e12a9f5f7ba852635c49e15'

cookbooks/aws-parallelcluster-environment/spec/unit/resources/efa_spec.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
# parallelcluster default source dir defined in attributes
44
source_dir = '/opt/parallelcluster/sources'
5-
efa_version = '1.41.0'
6-
efa_checksum = '3506354cdfbe31ff552fe75f5d0d9bb7efd29cf79bd99457347d29c751c38f9f'
5+
efa_version = '1.42.0'
6+
efa_checksum = '4114fe612905ee05083ae5cb391a00a012510f3abfecc642d86c9a5ae4be9008'
77

88
class ConvergeEfa
99
def self.setup(chef_run, efa_version: nil, efa_checksum: nil)

cookbooks/aws-parallelcluster-environment/test/controls/cfn_bootstrap_spec.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
pyenv_dir = "#{base_dir}/pyenv"
1414

1515
control 'tag:install_cfnbootstrap_virtualenv_created' do
16-
cfn_python_version = os_properties.alinux2? ? '3.9.20' : '3.12.8'
16+
cfn_python_version = os_properties.alinux2? ? '3.9.20' : '3.12.11'
1717
title "cfnbootstrap virtualenv should be created on #{cfn_python_version}"
1818
only_if { !os_properties.redhat_on_docker? }
1919

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
DEPENDENCIES
2+
aws-parallelcluster-platform
3+
path: .
4+
metadata: true
5+
aws-parallelcluster-shared
6+
path: ../aws-parallelcluster-shared
7+
line
8+
path: ../third-party/line-4.5.21
9+
yum
10+
path: ../third-party/yum-7.4.20
11+
yum-epel
12+
path: ../third-party/yum-epel-5.0.8
13+
14+
GRAPH
15+
aws-parallelcluster-platform (3.13.0)
16+
aws-parallelcluster-shared (~> 3.13.0)
17+
line (~> 4.5.21)
18+
aws-parallelcluster-shared (3.13.0)
19+
yum (~> 7.4.20)
20+
yum-epel (~> 5.0.8)
21+
line (4.5.21)
22+
yum (7.4.20)
23+
yum-epel (5.0.8)

cookbooks/aws-parallelcluster-platform/attributes/platform.rb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@
1717
# NVidia
1818
default['cluster']['nvidia']['enabled'] = 'no'
1919
default['cluster']['nvidia']['driver_version'] = '570.172.08'
20-
default['cluster']['nvidia']['dcgm_version'] = '3.3.6'
20+
default['cluster']['nvidia']['dcgm_version'] = '4.2.3-2'
2121
if platform?('amazon') && node['platform_version'] == "2"
2222
default['cluster']['nvidia']['driver_version'] = '550.127.08'
23+
default['cluster']['nvidia']['dcgm_version'] = '3.3.6-1'
2324
end
2425

2526
# nvidia-imex

0 commit comments

Comments
 (0)