Skip to content

Commit 9a5d856

Browse files
hanwen-clusterdemartinofra
authored andcommitted
Update EFA installer to 1.14.0
Starting from EFA 1.14.0, GDR support is enabled by default. Therefore, this commit also removes the logic to reinstall EFA if GDR is enabled. In kitchen test, the GDR check is moved to only check official built AMIs. This does not reduce the coverage of the test, because the check had `efa_gdr_enabled?` as part of the condition to execute. `enable_efa_gdr?` checks `node['cfncluster']['enable_efa_gdr']` in dna.json. However, in .kitchen.yml a parameter with wrong name `efa_gdr_enabled` was passed in. So the check was never executed. Signed-off-by: Hanwen <[email protected]>
1 parent 2333fe9 commit 9a5d856

File tree

7 files changed

+10
-27
lines changed

7 files changed

+10
-27
lines changed

.kitchen.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,6 @@ suites:
105105
dcv_port: '8443'
106106
enable_intel_hpc_platform: 'true'
107107
enable_efa: 'compute'
108-
efa_gdr_enabled: 'compute'
109108
nvidia:
110109
enabled: <%= ENV['NVIDIA_ENABLED'] %>
111110

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ This file is used to list changes made in each version of the AWS ParallelCluste
1010
- Fix failure when building AMI, due to SGE sources not available at arc.liv.ac.uk
1111
- Fix cluster update when using proxy setup.
1212
- Update ca-certificates package during AMI build time and prevent Chef from using outdated/distrusted CA certificates.
13+
- Upgrade EFA installer to version 1.14.0. Thereafter, EFA enables GDR support by default on supported instance type(s). ParallelCluster does not reinstall EFA during node start. Previously, EFA was reinstalled if `enable_efa_gdr` had been turned on in the configuration file.
1314

1415
2.11.2
1516
-----

attributes/default.rb

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,8 @@
153153
)
154154

155155
# EFA
156-
default['cfncluster']['efa']['installer_version'] = '1.13.0'
156+
default['cfncluster']['efa']['installer_version'] = '1.14.0'
157157
default['cfncluster']['efa']['installer_url'] = "https://efa-installer.amazonaws.com/aws-efa-installer-#{node['cfncluster']['efa']['installer_version']}.tar.gz"
158-
default['cfncluster']['enable_efa_gdr'] = "no"
159158
default['cfncluster']['efa']['unsupported_aarch64_oses'] = %w[centos7 centos8]
160159

161160
# NICE DCV

libraries/helpers.rb

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -413,17 +413,6 @@ def get_nvswitches
413413
nvswitch_check.stdout.strip.to_i
414414
end
415415

416-
# Check if EFA GDR is enabled (and supported) on this instance
417-
def efa_gdr_enabled?
418-
config_value = node['cfncluster']['enable_efa_gdr']
419-
enabling_value = if node['cfncluster']['cfn_node_type'] == "ComputeFleet"
420-
"compute"
421-
else
422-
"master"
423-
end
424-
(config_value == enabling_value || config_value == "cluster") && graphic_instance?
425-
end
426-
427416
# CentOS8 and alinux OSs currently not correctly supported by NFS cookbook
428417
# Overwriting templates for node['nfs']['config']['server_template'] used by NFS cookbook for these OSs
429418
# When running, NFS cookbook will use nfs.conf.erb templates provided in this cookbook to generate server_template

recipes/efa_config.rb

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,6 @@
1515
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
1616
# limitations under the License.
1717

18-
# Installation recipe must be re-executed at runtime to enable GDR
19-
include_recipe "aws-parallelcluster::efa_install"
20-
2118
if node['platform'] == 'ubuntu' && node['cfncluster']['enable_efa'] == 'compute' && node['cfncluster']['cfn_node_type'] == 'ComputeFleet'
2219
# Disabling ptrace protection is needed for EFA in order to use SHA transfer for intra-node communication.
2320
replace_or_add "disable ptrace protection" do

recipes/efa_install.rb

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
efa_installed = efa_installed?
2020

2121
if efa_installed && !::File.exist?(efa_tarball)
22-
Chef::Log.warn("Existing EFA version differs from the one shipped with ParallelCluster. Skipping ParallelCluster EFA installation and configuration. enable_gdr option will be ignored.")
22+
Chef::Log.warn("Existing EFA version differs from the one shipped with ParallelCluster. Skipping ParallelCluster EFA installation and configuration.")
2323
return
2424
end
2525

@@ -50,8 +50,6 @@
5050
installer_options = "-y"
5151
# skip efa-kmod installation on not supported platforms
5252
installer_options += " -k" unless node['conditions']['efa_supported']
53-
# enable gpudirect support
54-
installer_options += " -g" if efa_gdr_enabled?
5553

5654
bash "install efa" do
5755
cwd node['cfncluster']['sources_dir']
@@ -62,7 +60,7 @@
6260
./efa_installer.sh #{installer_options}
6361
rm -rf #{node['cfncluster']['sources_dir']}/aws-efa-installer
6462
EFAINSTALL
65-
not_if { efa_installed && !efa_gdr_enabled? }
63+
not_if { efa_installed }
6664
end
6765

6866
# EFA installer v1.11.0 removes libibverbs-core, which contains hwloc-devel during install

recipes/tests.rb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -363,12 +363,12 @@ module load intelmpi && mpirun --help | grep '#{node['cfncluster']['intelmpi']['
363363
grep "EFA installer version: #{node['cfncluster']['efa']['installer_version']}" /opt/amazon/efa_installed_packages
364364
EFA
365365
end
366-
end
367-
# GDR (GPUDirect RDMA)
368-
if node['conditions']['efa_supported'] && efa_gdr_enabled?
369-
execute 'check efa gdr installed' do
370-
command "modinfo efa | grep 'gdr:\ *Y'"
371-
user node['cfncluster']['cfn_cluster_user']
366+
# GDR (GPUDirect RDMA)
367+
if node['conditions']['efa_supported']
368+
execute 'check efa gdr installed' do
369+
command "modinfo efa | grep 'gdr:\ *Y'"
370+
user node['cfncluster']['cfn_cluster_user']
371+
end
372372
end
373373
end
374374

0 commit comments

Comments
 (0)