Skip to content

Commit 11cd7c7

Browse files
author
Himani Anil Deshpande
committed
[FABRIC MANAGER] Using common library for getting NVSwitch count
1 parent de10f70 commit 11cd7c7

File tree

2 files changed

+20
-8
lines changed

2 files changed

+20
-8
lines changed

cookbooks/aws-parallelcluster-platform/libraries/nvidia.rb

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,21 @@ def is_process_running(process_name)
1818

1919
!ps.stdout.strip.empty?
2020
end
21+
22+
#
23+
# Get Count of GPUs in instance
24+
#
25+
def get_nvswitch_count(device_id)
26+
shell_out("lspci -d #{device_id} | wc -l").stdout.strip.to_i
27+
end
28+
29+
def get_device_ids
30+
# A100 (P4), H100(P5), B200(P6) and GB200()p6e) systems have NVSwitches
31+
# NVSwitch device id is 10de:1af1 for P4 instance
32+
# NVSwitch device id is 10de:22a3 for P5 instance
33+
# NVSwitch device id is 10de:2901 for P6 instance
34+
# NVSwitch device id is 10de:2941 for P6e instance
35+
# We sum the count for all these deviceIds as output of lscpi command will be >0
36+
# for only one device ID based on the instance type
37+
{ 'a100' => '10de:1af1', 'h100' => '10de:22a3', 'b200' => '10de:2901', 'gb200' => '10de:2941' }
38+
end

cookbooks/aws-parallelcluster-platform/resources/fabric_manager/partial/_fabric_manager_common.rb

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -54,12 +54,6 @@ def _nvidia_driver_version
5454

5555
# Get number of nv switches
5656
def get_nvswitches
57-
# A100 (P4), H100(P5) and B200(P6) systems have NVSwitches
58-
# NVSwitch device id is 10de:1af1 for P4 instance
59-
# NVSwitch device id is 10de:22a3 for P5 instance
60-
# NVSwitch device id is 10de:2901 for P6 instance
61-
# We sum the count for all these deviceIds as output of lscpi command will be >0
62-
# for only one device ID based on the instance type
63-
nvswitch_device_ids = ['10de:1af1', '10de:22a3', '10de:2901']
64-
nvswitch_device_ids.sum { |id| shell_out("lspci -d #{id} | wc -l").stdout.strip.to_i }
57+
nvswitch_device_ids = get_device_ids.values
58+
nvswitch_device_ids.sum { |id| get_nvswitch_count(id) }
6559
end

0 commit comments

Comments
 (0)