Skip to content

Commit 670438d

Browse files
authored
Merge branch 'develop' into develop
2 parents 52ffd15 + 7d3ed57 commit 670438d

File tree

29 files changed

+622
-57
lines changed

29 files changed

+622
-57
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
88

99
**ENHANCEMENTS**
1010
- Add support for P6e-GB200 instances. ParallelCluster sets up Slurm topology plugin to handle P6e-GB200 UltraServers. See limitations section for important additional setup requirements.
11+
- Add support for P6-B200 instances for all OSs except AL2.
1112
- Add `build-image` support for Amazon Linux 2023 AMIs based on kernel 6.12 (in addition to 6.1).
1213

1314
**LIMITATIONS**
@@ -29,7 +30,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
2930
- Upgrade Cinc Client to version 18.4.12 (from 18.2.7).
3031
- Upgrade NVIDIA driver to version 570.172.08 (from 570.86.15) for all OSs except AL2.
3132
- Upgrade CUDA Toolkit to version 12.8.1 (from 12.8.0) for all OSs except AL2.
32-
- Upgrade DCGM to version 4.2.3 (from 3.3.6) for all OSs except AL2.
33+
- Upgrade DCGM to version 4.4.1 (from 3.3.6) for all OSs except AL2.
3334
- Upgrade Python to 3.12.11 (from 3.12.8) for all OSs except AL2.
3435
- Upgrade Python to 3.9.23 (from 3.9.20) for AL2.
3536
- Upgrade Intel MPI Library to 2021.16.0 (from 2021.13.1).

cookbooks/aws-parallelcluster-awsbatch/metadata.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@
77
issues_url 'https://github.com/aws/aws-parallelcluster/issues'
88
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
99
chef_version '>= 18'
10-
version '3.14.0'
10+
version '3.15.0'
1111

1212
depends 'iptables', '~> 8.0.0'
1313
depends 'nfs', '~> 5.1.5'
1414
depends 'line', '~> 4.5.21'
1515
depends 'openssh', '~> 2.11.14'
1616
depends 'yum', '~> 7.4.20'
1717
depends 'yum-epel', '~> 5.0.8'
18-
depends 'aws-parallelcluster-shared', '~> 3.14.0'
18+
depends 'aws-parallelcluster-shared', '~> 3.15.0'

cookbooks/aws-parallelcluster-computefleet/metadata.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,6 @@
77
issues_url 'https://github.com/aws/aws-parallelcluster-cookbook/issues'
88
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
99
chef_version '>= 18'
10-
version '3.14.0'
10+
version '3.15.0'
1111

12-
depends 'aws-parallelcluster-shared', '~> 3.14.0'
12+
depends 'aws-parallelcluster-shared', '~> 3.15.0'

cookbooks/aws-parallelcluster-entrypoints/metadata.rb

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,11 @@
77
issues_url 'https://github.com/aws/aws-parallelcluster-cookbook/issues'
88
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
99
chef_version '>= 18'
10-
version '3.14.0'
10+
version '3.15.0'
1111

12-
depends 'aws-parallelcluster-shared', '~> 3.14.0'
13-
depends 'aws-parallelcluster-platform', '~> 3.14.0'
14-
depends 'aws-parallelcluster-environment', '~> 3.14.0'
15-
depends 'aws-parallelcluster-computefleet', '~> 3.14.0'
16-
depends 'aws-parallelcluster-slurm', '~> 3.14.0'
17-
depends 'aws-parallelcluster-awsbatch', '~> 3.14.0'
12+
depends 'aws-parallelcluster-shared', '~> 3.15.0'
13+
depends 'aws-parallelcluster-platform', '~> 3.15.0'
14+
depends 'aws-parallelcluster-environment', '~> 3.15.0'
15+
depends 'aws-parallelcluster-computefleet', '~> 3.15.0'
16+
depends 'aws-parallelcluster-slurm', '~> 3.15.0'
17+
depends 'aws-parallelcluster-awsbatch', '~> 3.15.0'

cookbooks/aws-parallelcluster-environment/metadata.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@
77
issues_url 'https://github.com/aws/aws-parallelcluster-cookbook/issues'
88
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
99
chef_version '>= 18'
10-
version '3.14.0'
10+
version '3.15.0'
1111

1212
depends 'line', '~> 4.5.21'
1313
depends 'nfs', '~> 5.1.5'
1414

15-
depends 'aws-parallelcluster-shared', '~> 3.14.0'
15+
depends 'aws-parallelcluster-shared', '~> 3.15.0'

cookbooks/aws-parallelcluster-platform/attributes/platform.rb

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
# NVidia
1818
default['cluster']['nvidia']['enabled'] = 'no'
1919
default['cluster']['nvidia']['driver_version'] = '570.172.08'
20-
default['cluster']['nvidia']['dcgm_version'] = '4.2.3-2'
20+
default['cluster']['nvidia']['dcgm_version'] = '4.4.1-1'
2121
if platform?('amazon') && node['platform_version'] == "2"
2222
default['cluster']['nvidia']['driver_version'] = '550.127.08'
2323
default['cluster']['nvidia']['dcgm_version'] = '3.3.6-1'
@@ -27,6 +27,9 @@
2727
default['cluster']['nvidia']['imex']['shared_dir'] = "#{node['cluster']['shared_dir']}/nvidia-imex"
2828
default['cluster']['nvidia']['imex']['force_configuration'] = false
2929

30+
# NVIDIA NVLSM
31+
default['cluster']['nvidia']['nvlsm']['enabled'] = true
32+
3033
# DCV
3134
default['cluster']['dcv']['authenticator']['user'] = "dcvextauth"
3235
default['cluster']['dcv']['authenticator']['user_id'] = node['cluster']['reserved_base_uid'] + 3
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
ib_umad

cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,12 @@ def get_nvswitch_count(device_id)
2727
end
2828

2929
def get_device_ids
30-
# A100 (P4), H100(P5), B200(P6) and GB200()p6e) systems have NVSwitches
30+
# A100 (P4), H100(P5), B200(P6) and GB200(p6e) systems have NVSwitches
3131
# NVSwitch device id is 10de:1af1 for P4 instance
3232
# NVSwitch device id is 10de:22a3 for P5 instance
33+
# NVSwitch device id is 10de:2901 for P6 instance
3334
# NVSwitch device id is 10de:2941 for P6e instance
34-
{ 'a100' => '10de:1af1', 'h100' => '10de:22a3', 'gb200' => '10de:2941' }
35+
{ 'a100' => '10de:1af1', 'h100' => '10de:22a3', 'b200' => '10de:2901', 'gb200' => '10de:2941' }
3536
end
3637

3738
def is_gb200_node?

cookbooks/aws-parallelcluster-platform/metadata.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@
77
issues_url 'https://github.com/aws/aws-parallelcluster-cookbook/issues'
88
source_url 'https://github.com/aws/aws-parallelcluster-cookbook'
99
chef_version '>= 18'
10-
version '3.14.0'
10+
version '3.15.0'
1111

1212
depends 'line', '~> 4.5.21'
1313

14-
depends 'aws-parallelcluster-shared', '~> 3.14.0'
14+
depends 'aws-parallelcluster-shared', '~> 3.15.0'

cookbooks/aws-parallelcluster-platform/recipes/install/nvidia_install.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
gdrcopy 'Install Nvidia gdrcopy'
2323

24+
nvidia_nvlsm 'Install Nvidia NVLink Subnet Manager'
25+
2426
fabric_manager 'Install Nvidia Fabric Manager'
2527

2628
nvidia_dcgm 'install Nvidia datacenter-gpu-manager'

0 commit comments

Comments
 (0)