Skip to content

Commit 9db7192

Browse files
author
Vijay Manickam
committed
Expose all options meshpinger internally supports
- Pass all arguments given to this script as is to meshpinger binary - Moved hpc cluster specific clustername and all logic around it in the script to meshpinger binary. No change in logic - Added a readme for meshpinger Testing: * Options - [opc@meshpinger-bm1-controller healthchecks]$ ./run_meshpinger.sh -h Tue Oct 8 21:16:54 GMT 2024 Identity added: /home/opc/.ssh/id_rsa (/home/opc/.ssh/id_rsa) /opt/oci-hpc/healthchecks/meshpinger_bm usage: ./run_meshpinger.sh [-h] [--hostlistfile HOSTLISTFILE | --hpcclustername HPCCLUSTERNAME] [--clusterinfo CLUSTERINFO] [--ssh_port SSH_PORT] [--ping_timeout PING_TIMEOUT] [--dump_arp_on_failure] [--flush_arp] [--nic_model NIC_MODEL] [--objectstoreurl OBJECTSTOREURL] [--singlesubnet] [--threads_per_intf THREADS_PER_INTF] [--verbose] optional arguments: -h, --help show this help message and exit --hostlistfile HOSTLISTFILE File listing name/ip of the hosts to include in meshping --hpcclustername HPCCLUSTERNAME OCI HPC stack clustername --clusterinfo CLUSTERINFO Use this cluster info file (generated from previous runs) and skip gathering cluster information in this run --ssh_port SSH_PORT ssh port to use, port 22 will be used if not specified --ping_timeout PING_TIMEOUT Duration ping waits for reply before timing out, default is 1sec --dump_arp_on_failure Log arp entry for failed pings --flush_arp Flush arp cache before starting pinger --nic_model NIC_MODEL Model of the RDMA NIC eg. MT2910(CX-7) to use if auto detect fails --objectstoreurl OBJECTSTOREURL ObjectStore PAR URL where mesh pinger logs will be uploaded --singlesubnet Include this argument if all RDMA NICs are on a single subnetted cluster network. If so pinger will do a full mesh ping --threads_per_intf THREADS_PER_INTF parallel ping threads per local rdma interface, default is 16 --verbose Log all debug messages including successful pings. Default is to log only failed pings * Tested on 2 new node cluster created using hpc stack with following params ./run_meshpinger.sh --nic_model MT28800 --singlesubnet ./run_meshpinger.sh --hpcclustername meshpinger-BM1 --nic_model MT28800 --singlesubnet ./run_meshpinger.sh --hostlistfile meshpinger_bm/hosts --nic_model MT28800 --singlesubnet nic_model is needed here since these are Non-GPU host shapes
1 parent 126cc23 commit 9db7192

File tree

2 files changed

+184
-51
lines changed

2 files changed

+184
-51
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
2+
# OCI Meshpinger
3+
4+
Meshpinger is a tool for validating network layer connectivity between RDMA NICs on a
5+
cluster network in OCI. The tool is capable of initiating ICMP ping from every RDMA NIC
6+
port on the cluster network to every other RDMA NIC port on the same cluster network and
7+
reporting back the success/failure status of the pings performed in the form of logs
8+
9+
Running the tool before starting workload on a cluster network should serve as a good precheck
10+
step to gain confidence on the network reachability between RDMA NICs. Typical causes for
11+
reachability failures that the tool can help pinpoint are,
12+
13+
1. Host rdma interface down
14+
15+
2. Host rdma interface missing IP configuration
16+
17+
3. Host rdma interface missing mac
18+
19+
4. Host rdma interface enumeration issues
20+
21+
5. ping failure between a <src,dst> pair of IPs
22+
23+
# Running Meshpinger
24+
25+
Meshpinger is installed on controller node of the hpc cluster and can be run in following ways after logging into the controller node
26+
27+
- Run meshpinger on all nodes in the cluster, cluster is auto-detected in this case
28+
```
29+
/opt/oci-hpc/healthchecks/run_meshpinger.sh
30+
```
31+
32+
- Run meshpinger on all nodes in the cluster explicitly specified by clustername
33+
```
34+
/opt/oci-hpc/healthchecks/run_meshpinger.sh --hpcclustername <hpcclustername>
35+
```
36+
37+
Run meshpinger on a list of nodes specified in a file. A host can be specified by its ssh IP address or hostname but it should be SSH-able from controller node
38+
```
39+
/opt/oci-hpc/healthchecks/run_meshpinger.sh --hostlisttfile <filename>
40+
```
41+
42+
# Output
43+
44+
- All rdma interface configuration issues are reported like the sample below,
45+
46+
```
47+
Faulty RDMA interfaces(Link down/misconfigured)
48+
49+
Hostid/Serial/hostname Interface RDMA_IP PCI MAC Link Status
50+
-- -------------------------- ----------- --------- ------------ ----------------- -------------
51+
0 GPU-711/2109XCL016/GPU-711 rdma1 0.0.0.0 0000:98:00.1 b8:ce:f6:00:12:29 DOWN
52+
1 GPU-278/2110XCL04V/GPU-278 rdma1 0.0.0.0 0000:98:00.1 04:3f:72:e0:6b:0d DOWN
53+
```
54+
55+
- If there are ping failures from the run, total number of unique <srcInterface,dstInterface> pings that failed per host is printed as a table like the sample below,
56+
57+
```
58+
ICMP ping failures per host
59+
60+
Hostid/Serial/Hostname Total Failures
61+
-- -------------------------- ----------------
62+
0 GPU-711/2109XCL016/GPU-711 1
63+
1 GPU-278/2110XCL04V/GPU-278 1
64+
```
65+
Logfile of the current run that enumerates all <srcInterface,dstInterface> combinations that failed ping is printed like,
66+
67+
```
68+
<src,dst> interfaces that failed ping is listed at end of the log file meshpinger_log_20241008220615_ocid1.tenancy.oc1..aaaaaaaabddc4obuhgvifcrh6esmw6554ityaqrvxulcksl255gbwehtcq.txt
69+
```
70+
71+
72+
- If there are no ping failures from the run, following message is printed
73+
74+
```
75+
All pings succeeded!!
76+
```
77+
78+
# Options
79+
Other options supported are shown in the help text below.
80+
81+
```
82+
/opt/oci-hpc/healthchecks/run_meshpinger.sh --help
83+
84+
usage: ./run_meshpinger.sh [-h]
85+
[--hostlistfile HOSTLISTFILE | --hpcclustername HPCCLUSTERNAME]
86+
[--clusterinfo CLUSTERINFO] [--ssh_port SSH_PORT]
87+
[--ping_timeout PING_TIMEOUT]
88+
[--dump_arp_on_failure] [--flush_arp]
89+
[--nic_model NIC_MODEL]
90+
[--objectstoreurl OBJECTSTOREURL] [--singlesubnet]
91+
[--threads_per_intf THREADS_PER_INTF] [--verbose]
92+
93+
optional arguments:
94+
-h, --help show this help message and exit
95+
--hostlistfile HOSTLISTFILE
96+
File listing name/ip of the hosts to include in
97+
meshping
98+
--hpcclustername HPCCLUSTERNAME
99+
OCI HPC stack clustername
100+
--clusterinfo CLUSTERINFO
101+
Use this cluster info file (generated from previous
102+
runs) and skip gathering cluster information in this
103+
run
104+
--ssh_port SSH_PORT ssh port to use, port 22 will be used if not specified
105+
--ping_timeout PING_TIMEOUT
106+
Duration ping waits for reply before timing out,
107+
default is 1sec
108+
--dump_arp_on_failure
109+
Log arp entry for failed pings
110+
--flush_arp Flush arp cache before starting pinger
111+
--nic_model NIC_MODEL
112+
Model of the RDMA NIC eg. MT2910(CX-7) to use if auto
113+
detect fails
114+
--objectstoreurl OBJECTSTOREURL
115+
ObjectStore PAR URL where mesh pinger logs will be
116+
uploaded
117+
--singlesubnet Include this argument if all RDMA NICs are on a single
118+
subnetted cluster network. If so pinger will do a full
119+
mesh ping
120+
--threads_per_intf THREADS_PER_INTF
121+
parallel ping threads per local rdma interface,
122+
default is 16
123+
--verbose Log all debug messages including successful pings.
124+
Default is to log only failed pings
125+
```
126+
127+
# Description
128+
Detailed description of each option is below,
129+
130+
**--hostlistfile**
131+
132+
Path to file containing the list of hosts to be used for current meshpinger run. A host can be specified by its IP address or its hostname but it should be SSH-able via either of these 2 strings specified. String specified here is listed as Hostid on the final report of meshping run
133+
134+
**--hpcclustername**
135+
136+
Clustername specified when the cluster was created using OCI HPC stack
137+
138+
139+
**--clusterinfo**
140+
141+
File containing cluster information generated from a previous meshpinger run. When this is specified, current run will skip gathering RDMA interface details from the hosts and move on to doing actual meshping tests saving some runtime. Note that specifying this option forces meshpinger to use RDMA interface details collected previously which could be stale especially for attributes like link state, IP assignment
142+
143+
**--ssh_port**
144+
145+
Port to use for ssh to hosts specified in the hostlistfile. By default port 22 will be used if this is not specified
146+
147+
**--ping_timeout**
148+
149+
Time in milliseconds that ping waits for a successful reply from remote IP including the time it takes for arp resolution. This timeout is 1 second by default if this option is not specified and overall meshpinger performs 10 retries for each of the remote IPs before marking it as a ping failure
150+
151+
**--dump_arp_on_failure**
152+
153+
When this option is specified, for each of the ping failures the corresponding arp table entry(including the status field) for the remote IP on the local host will be dumped in meshpinger logs. By default this is disabled
154+
155+
**--flush_arp**
156+
157+
When this option is specified, meshpinger will flush the arp table on each of the hosts before starting the ping validation test
158+
159+
**--nic_model**
160+
161+
NIC model to use (e.g MT2910 for CX-7) for filtering out RDMA interfaces from front-end network interfaces while gathering RDMA interface information on each host. By default, meshpinger determines the model based on the model of majority of interfaces on the host given that backend network interface count always exceeds frontend network interface count.
162+
163+
**--objectstoreurl**
164+
165+
Pre-Authenticated Request(PAR) url where meshpinger logs will be uploaded. This can be used by customers to easily share meshpinger logs to OCI operator. OCI operator can create a PAR to objectstore bucket owned by them and share it with customer to enable them to share the logs
166+
167+
**--singlesubnet**
168+
169+
This option specifies all rdma interfaces on hosts in the hostlist file are part of a single subnet. In this case meshpinger will do pings to all remote IPs from all local interfaces on a given host. It is to be noted that when this option is chosen net.ipv4.neigh.default.gc_threshX [X=1-3] sysctl setting on every host may need to be bumped up to hold the necessary arp entries per local interface. Eg. For running meshpinger on a 512 host cluster with each host having 16 rdma interface, size of the arp table should be atleast 130816(511 * 16 * 16). Accordingly it is recommended to set all the 3 sysctl thresholds - net.ipv4.neigh.default.gc_threshX[X=1-3] to 130816. Be default meshpinger assumes each rdma interface on a host is on a separate subnet and performs pings between rdma interfaces that have the same pci address.
170+
171+
**--threads_per_intf**
172+
173+
By default meshpinger running on each of the hosts in the hostlist file uses 16 parallel threads per local interface to perform parallel pings. This option overrides that setting with allowed values of 1-32
174+
175+
**--verbose**
176+
177+
By default only ping failures are logged to limit the log file size. When this option is specified succeeding pings are also logged
178+
179+
Lines changed: 5 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,10 @@
11
#!/bin/bash
22

3-
if [ "$1" == "-h" ]; then
4-
echo "Usage: ./run_meshpinger.sh [options]"
5-
echo "INFO:"
6-
echo "Meshpinger is a tool for validating network layer connectivity between RDMA NICs on a"
7-
echo "cluster network in OCI. The tool is capable of initiating ICMP ping from every RDMA NIC"
8-
echo "port on the cluster network to every other RDMA NIC port on the same cluster network and"
9-
echo "reporting back the success/failure status of the pings performed in the form of logs"
10-
11-
echo "Running the tool before starting workload on a cluster network should serve as a good precheck"
12-
echo "step to gain confidence on the network reachability between RDMA NICs. Typical causes for"
13-
echo "reachability failures that the tool can help pinpoint are,"
14-
echo "1. Link down on the RDMA NIC"
15-
echo "2. RDMA interface initialization or configuration issues including IP address assignment to"
16-
echo "the interface"
17-
echo "3. Insufficient ARP table size on the node to store all needed peer mac addresses"
18-
echo " "
19-
echo "Options:"
20-
echo " -h Display this help message"
21-
echo " [arg1] Enter either --clustername or --hostlist"
22-
echo " [arg2] Enter the clustername or the path of the hostlist based on arg1"
23-
# Exit the script after showing the help message
24-
exit 0
25-
fi
26-
27-
if [ $# -gt 0 ]; then
28-
if [ "$1" == "--clustername" ]; then
29-
cluster_name=$2
30-
else
31-
cat $2 > /tmp/all_hosts
32-
fi
33-
else
34-
cluster_name=`cat /etc/ansible/hosts | grep cluster_name | awk '{print $3}'`
35-
fi
36-
3+
export WRAPPER_BIN="$0"
4+
export WRAPPER_ENV="OCI_HPC_STACK"
375
date
38-
eval "$(ssh-agent -s)" >/dev/null ; ssh-add ~/.ssh/id_rsa >/dev/null
396

40-
if [ -z "$cluster_name" ]; then
41-
if [ -f "$2" ]; then
42-
echo "Using $2 as hostlist"
43-
else
44-
echo "The clustername is empty, running on all hosts"
45-
cat /etc/hosts | grep .local.vcn | awk '{print $2}' > /tmp/all_hosts
46-
fi
47-
else
48-
echo "Clustername is $2"
49-
cat /etc/hosts | grep .local.vcn | grep ${cluster_name} | awk '{print $2}' > /tmp/all_hosts
50-
fi
51-
output_file="/tmp/failed_nodes"
52-
/opt/oci-hpc/healthchecks/meshpinger_bm/run_meshpinger --hostlistfile /tmp/all_hosts --singlesubnet --ping_timeout 100 2>&1 | grep "INCOM\|DELAY" | awk '{print $6}' | sort -u | tee ${output_file}
7+
eval "$(ssh-agent -s)" >/dev/null ; ssh-add ~/.ssh/id_rsa >/dev/null
538

54-
if [ ! -s "$output_file" ]; then
55-
echo "No nodes have RDMA connections unreachable. The list of tested nodes is at /tmp/all_hosts"
56-
fi
9+
# Run meshpinger
10+
/opt/oci-hpc/healthchecks/meshpinger_bm/run_meshpinger "$@"

0 commit comments

Comments
 (0)