Skip to content

Commit 1e55551

Browse files
Merge pull request #209 from oci-hpc/2.11.0-vimuthuv
Expose all options meshpinger internally supports
2 parents 126cc23 + 5fe6cb5 commit 1e55551

File tree

2 files changed

+188
-51
lines changed

2 files changed

+188
-51
lines changed
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
2+
# OCI Meshpinger
3+
4+
Meshpinger is a tool for validating network layer connectivity between RDMA NICs on a
5+
cluster network in OCI. The tool initiates an ICMP ping from every RDMA NIC
6+
port on the cluster network to every other RDMA NIC port on the same cluster network and
7+
reports back the success/failure status of the pings performed in the form of logs
8+
9+
Running the tool before starting workload on a cluster network should serve as a good precheck
10+
step to gain confidence on the network reachability between RDMA NICs. Typical causes for
11+
reachability failures that the tool can help pinpoint are,
12+
13+
1. Host rdma interface down
14+
15+
2. Host rdma interface missing IP configuration
16+
17+
3. Host rdma interface missing mac
18+
19+
4. Host rdma interface enumeration issues
20+
21+
5. Network connectivity issues between <src,dst> pair of IPs
22+
23+
# Running Meshpinger
24+
25+
Meshpinger is installed on the controller host of the hpc cluster. Once user is logged into the controller host, they can trigger meshpinger using the following options,
26+
27+
- Run meshpinger on all hosts in the cluster. The cluster is auto-detected in this option.
28+
```
29+
/opt/oci-hpc/healthchecks/run_meshpinger.sh
30+
```
31+
32+
- Run meshpinger on all hosts in the cluster explicitly specified by clustername
33+
```
34+
/opt/oci-hpc/healthchecks/run_meshpinger.sh --hpcclustername <hpcclustername>
35+
```
36+
37+
Run meshpinger on a list of hosts specified in a file. A host can be specified by its IP address or hostname. It is expected that the host will be SSH-able from the controller host
38+
```
39+
/opt/oci-hpc/healthchecks/run_meshpinger.sh --hostlisttfile <filename>
40+
```
41+
42+
# Output
43+
44+
- All rdma interface configuration issues are reported like the sample below,
45+
46+
```
47+
Faulty RDMA interfaces(Link down/misconfigured)
48+
49+
Hostid/Serial/hostname Interface RDMA_IP PCI MAC Link Status
50+
-- -------------------------- ----------- --------- ------------ ----------------- -------------
51+
0 GPU-711/2109XCL016/GPU-711 rdma1 0.0.0.0 0000:98:00.1 b8:ce:f6:00:12:29 DOWN
52+
1 GPU-278/2110XCL04V/GPU-278 rdma1 0.0.0.0 0000:98:00.1 04:3f:72:e0:6b:0d DOWN
53+
```
54+
55+
- If there are ping failures from the run, total number of unique <srcInterface,dstInterface> pings that failed per host is printed as a table like the sample below,
56+
57+
```
58+
ICMP ping failures per host
59+
60+
Hostid/Serial/Hostname Total Failures
61+
-- -------------------------- ----------------
62+
0 GPU-711/2109XCL016/GPU-711 1
63+
1 GPU-278/2110XCL04V/GPU-278 1
64+
```
65+
Logfile of the current run that enumerates all <srcInterface,dstInterface> combinations that failed ping is printed like,
66+
67+
```
68+
<src,dst> interfaces that failed ping are listed at end of the log file meshpinger_log_20241008220615_ocid1.tenancy.oc1..aaaaaaaabddc4obuhgvifcrh6esmw6554ityaqrvxulcksl255gbwehtcq.txt
69+
```
70+
71+
72+
- If there are no ping failures from the run, following message is printed
73+
74+
```
75+
All pings succeeded!!
76+
```
77+
- Cluster information that includes rdma interface details gathered from the run is stored in a file cluster_info.txt in the current directory, same is printed as below,
78+
```
79+
clusterinfo file - cluster_info.txt
80+
```
81+
82+
# Options
83+
Other options supported are shown in the help text below.
84+
85+
```
86+
/opt/oci-hpc/healthchecks/run_meshpinger.sh --help
87+
88+
usage: ./run_meshpinger.sh [-h]
89+
[--hostlistfile HOSTLISTFILE | --hpcclustername HPCCLUSTERNAME]
90+
[--clusterinfo CLUSTERINFO] [--ssh_port SSH_PORT]
91+
[--ping_timeout PING_TIMEOUT]
92+
[--dump_arp_on_failure] [--flush_arp]
93+
[--nic_model NIC_MODEL]
94+
[--objectstoreurl OBJECTSTOREURL] [--singlesubnet]
95+
[--threads_per_intf THREADS_PER_INTF] [--verbose]
96+
97+
optional arguments:
98+
-h, --help show this help message and exit
99+
--hostlistfile HOSTLISTFILE
100+
File listing name/ip of the hosts to include in
101+
meshping
102+
--hpcclustername HPCCLUSTERNAME
103+
OCI HPC stack clustername
104+
--clusterinfo CLUSTERINFO
105+
Use this cluster info file (generated from previous
106+
runs) and skip gathering cluster information in this
107+
run
108+
--ssh_port SSH_PORT ssh port to use, port 22 will be used if not specified
109+
--ping_timeout PING_TIMEOUT
110+
Duration ping waits for reply before timing out,
111+
default is 1sec
112+
--dump_arp_on_failure
113+
Log arp entry for failed pings
114+
--flush_arp Flush arp cache before starting pinger
115+
--nic_model NIC_MODEL
116+
Model of the RDMA NIC eg. MT2910(CX-7) to use if auto
117+
detect fails
118+
--objectstoreurl OBJECTSTOREURL
119+
ObjectStore PAR URL where mesh pinger logs will be
120+
uploaded
121+
--enable_inter_rail_ping
122+
Include this argument to perform pings across the rails.
123+
If so pinger will do a full mesh ping
124+
--threads_per_intf THREADS_PER_INTF
125+
parallel ping threads per local rdma interface,
126+
default is 16
127+
--verbose Log all debug messages including successful pings.
128+
Default is to log only failed pings
129+
```
130+
131+
# Description
132+
Detailed description of each option is below,
133+
134+
**--hostlistfile**
135+
136+
Path to file containing the list of hosts to be used for current meshpinger run. A host can be specified by its IP address or its hostname but it should be SSH-able via either of these 2 strings specified. String specified here is listed as Hostid on the final report of meshping run
137+
138+
**--hpcclustername**
139+
140+
Clustername specified when the cluster was created using OCI HPC stack
141+
142+
143+
**--clusterinfo**
144+
145+
File containing cluster information generated from a previous meshpinger run. When this is specified, current run will skip gathering RDMA interface details from the hosts and move on to doing actual meshping tests saving some runtime. Note that specifying this option forces meshpinger to use RDMA interface details collected previously which could be stale especially for attributes like link state, IP assignment
146+
147+
**--ssh_port**
148+
149+
Port to use for ssh to hosts specified in the hostlistfile. By default port 22 will be used if this is not specified
150+
151+
**--ping_timeout**
152+
153+
Time in milliseconds that ping waits for a successful reply from remote IP including the time it takes for arp resolution. This timeout is 1 second by default if this option is not specified and overall meshpinger performs 10 retries for each of the remote IPs before marking it as a ping failure
154+
155+
**--dump_arp_on_failure**
156+
157+
When this option is specified, for each of the ping failures the corresponding arp table entry(including the status field) for the remote IP on the local host will be dumped in meshpinger logs. By default this is disabled
158+
159+
**--flush_arp**
160+
161+
When this option is specified, meshpinger will flush the arp table on each of the hosts before starting the ping validation test
162+
163+
**--nic_model**
164+
165+
NIC model to use (e.g MT2910 for CX-7) for filtering out RDMA interfaces from front-end network interfaces while gathering RDMA interface information on each host. By default, meshpinger determines the model based on the model of majority of interfaces on the host given that backend network interface count always exceeds frontend network interface count.
166+
167+
**--objectstoreurl**
168+
169+
Pre-Authenticated Request(PAR) url where meshpinger logs will be uploaded. This can be used by customers to easily share meshpinger logs with OCI during any incidents. OCI can provide a PAR to objectstore bucket and share it with customer to enable sharing of meshpinger logs.
170+
171+
**--enable_inter_rail_ping**
172+
173+
This option specifies all rdma interfaces on hosts in the hostlist file are part of a single subnet. In this case meshpinger will do pings to all remote IPs from all local interfaces on a given host. It is to be noted that when this option is chosen net.ipv4.neigh.default.gc_threshX [X=1-3] sysctl setting on every host may need to be bumped up to hold the necessary arp entries per local interface. Eg. For running meshpinger on a 512 host cluster with each host having 16 rdma interface, size of the arp table should be atleast 130816(511 * 16 * 16). Accordingly it is recommended to set all the 3 sysctl thresholds - net.ipv4.neigh.default.gc_threshX[X=1-3] to 130816. By default, meshpinger only pings along the rails.
174+
175+
**--threads_per_intf**
176+
177+
By default meshpinger running on each of the hosts in the hostlist file uses 16 parallel threads per local interface to perform parallel pings. This option overrides that setting with allowed values of 1-32
178+
179+
**--verbose**
180+
181+
By default only ping failures are logged to limit the log file size. When this option is specified succeeding pings are also logged
182+
183+
Lines changed: 5 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,10 @@
11
#!/bin/bash
22

3-
if [ "$1" == "-h" ]; then
4-
echo "Usage: ./run_meshpinger.sh [options]"
5-
echo "INFO:"
6-
echo "Meshpinger is a tool for validating network layer connectivity between RDMA NICs on a"
7-
echo "cluster network in OCI. The tool is capable of initiating ICMP ping from every RDMA NIC"
8-
echo "port on the cluster network to every other RDMA NIC port on the same cluster network and"
9-
echo "reporting back the success/failure status of the pings performed in the form of logs"
10-
11-
echo "Running the tool before starting workload on a cluster network should serve as a good precheck"
12-
echo "step to gain confidence on the network reachability between RDMA NICs. Typical causes for"
13-
echo "reachability failures that the tool can help pinpoint are,"
14-
echo "1. Link down on the RDMA NIC"
15-
echo "2. RDMA interface initialization or configuration issues including IP address assignment to"
16-
echo "the interface"
17-
echo "3. Insufficient ARP table size on the node to store all needed peer mac addresses"
18-
echo " "
19-
echo "Options:"
20-
echo " -h Display this help message"
21-
echo " [arg1] Enter either --clustername or --hostlist"
22-
echo " [arg2] Enter the clustername or the path of the hostlist based on arg1"
23-
# Exit the script after showing the help message
24-
exit 0
25-
fi
26-
27-
if [ $# -gt 0 ]; then
28-
if [ "$1" == "--clustername" ]; then
29-
cluster_name=$2
30-
else
31-
cat $2 > /tmp/all_hosts
32-
fi
33-
else
34-
cluster_name=`cat /etc/ansible/hosts | grep cluster_name | awk '{print $3}'`
35-
fi
36-
3+
export WRAPPER_BIN="$0"
4+
export WRAPPER_ENV="OCI_HPC_STACK"
375
date
38-
eval "$(ssh-agent -s)" >/dev/null ; ssh-add ~/.ssh/id_rsa >/dev/null
396

40-
if [ -z "$cluster_name" ]; then
41-
if [ -f "$2" ]; then
42-
echo "Using $2 as hostlist"
43-
else
44-
echo "The clustername is empty, running on all hosts"
45-
cat /etc/hosts | grep .local.vcn | awk '{print $2}' > /tmp/all_hosts
46-
fi
47-
else
48-
echo "Clustername is $2"
49-
cat /etc/hosts | grep .local.vcn | grep ${cluster_name} | awk '{print $2}' > /tmp/all_hosts
50-
fi
51-
output_file="/tmp/failed_nodes"
52-
/opt/oci-hpc/healthchecks/meshpinger_bm/run_meshpinger --hostlistfile /tmp/all_hosts --singlesubnet --ping_timeout 100 2>&1 | grep "INCOM\|DELAY" | awk '{print $6}' | sort -u | tee ${output_file}
7+
eval "$(ssh-agent -s)" >/dev/null ; ssh-add ~/.ssh/id_rsa >/dev/null
538

54-
if [ ! -s "$output_file" ]; then
55-
echo "No nodes have RDMA connections unreachable. The list of tested nodes is at /tmp/all_hosts"
56-
fi
9+
# Run meshpinger
10+
/opt/oci-hpc/healthchecks/meshpinger_bm/run_meshpinger "$@"

0 commit comments

Comments
 (0)