Skip to content

Commit 5fe6cb5

Browse files
author
Vijay Manickam
committed
Address review comments
1 parent 9db7192 commit 5fe6cb5

File tree

1 file changed

+18
-14
lines changed

1 file changed

+18
-14
lines changed

playbooks/roles/healthchecks/files/meshpinger_readme.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
# OCI Meshpinger
33

44
Meshpinger is a tool for validating network layer connectivity between RDMA NICs on a
5-
cluster network in OCI. The tool is capable of initiating ICMP ping from every RDMA NIC
5+
cluster network in OCI. The tool initiates an ICMP ping from every RDMA NIC
66
port on the cluster network to every other RDMA NIC port on the same cluster network and
7-
reporting back the success/failure status of the pings performed in the form of logs
7+
reports back the success/failure status of the pings performed in the form of logs
88

99
Running the tool before starting workload on a cluster network should serve as a good precheck
1010
step to gain confidence on the network reachability between RDMA NICs. Typical causes for
@@ -18,23 +18,23 @@ reachability failures that the tool can help pinpoint are,
1818

1919
4. Host rdma interface enumeration issues
2020

21-
5. ping failure between a <src,dst> pair of IPs
21+
5. Network connectivity issues between <src,dst> pair of IPs
2222

2323
# Running Meshpinger
2424

25-
Meshpinger is installed on controller node of the hpc cluster and can be run in following ways after logging into the controller node
25+
Meshpinger is installed on the controller host of the hpc cluster. Once user is logged into the controller host, they can trigger meshpinger using the following options,
2626

27-
- Run meshpinger on all nodes in the cluster, cluster is auto-detected in this case
27+
- Run meshpinger on all hosts in the cluster. The cluster is auto-detected in this option.
2828
```
2929
/opt/oci-hpc/healthchecks/run_meshpinger.sh
3030
```
3131

32-
- Run meshpinger on all nodes in the cluster explicitly specified by clustername
32+
- Run meshpinger on all hosts in the cluster explicitly specified by clustername
3333
```
3434
/opt/oci-hpc/healthchecks/run_meshpinger.sh --hpcclustername <hpcclustername>
3535
```
3636

37-
Run meshpinger on a list of nodes specified in a file. A host can be specified by its ssh IP address or hostname but it should be SSH-able from controller node
37+
Run meshpinger on a list of hosts specified in a file. A host can be specified by its IP address or hostname. It is expected that the host will be SSH-able from the controller host
3838
```
3939
/opt/oci-hpc/healthchecks/run_meshpinger.sh --hostlisttfile <filename>
4040
```
@@ -65,7 +65,7 @@ ICMP ping failures per host
6565
Logfile of the current run that enumerates all <srcInterface,dstInterface> combinations that failed ping is printed like,
6666

6767
```
68-
<src,dst> interfaces that failed ping is listed at end of the log file meshpinger_log_20241008220615_ocid1.tenancy.oc1..aaaaaaaabddc4obuhgvifcrh6esmw6554ityaqrvxulcksl255gbwehtcq.txt
68+
<src,dst> interfaces that failed ping are listed at end of the log file meshpinger_log_20241008220615_ocid1.tenancy.oc1..aaaaaaaabddc4obuhgvifcrh6esmw6554ityaqrvxulcksl255gbwehtcq.txt
6969
```
7070

7171

@@ -74,6 +74,10 @@ Logfile of the current run that enumerates all <srcInterface,dstInterface> combi
7474
```
7575
All pings succeeded!!
7676
```
77+
- Cluster information that includes rdma interface details gathered from the run is stored in a file cluster_info.txt in the current directory, same is printed as below,
78+
```
79+
clusterinfo file - cluster_info.txt
80+
```
7781

7882
# Options
7983
Other options supported are shown in the help text below.
@@ -114,9 +118,9 @@ optional arguments:
114118
--objectstoreurl OBJECTSTOREURL
115119
ObjectStore PAR URL where mesh pinger logs will be
116120
uploaded
117-
--singlesubnet Include this argument if all RDMA NICs are on a single
118-
subnetted cluster network. If so pinger will do a full
119-
mesh ping
121+
--enable_inter_rail_ping
122+
Include this argument to perform pings across the rails.
123+
If so pinger will do a full mesh ping
120124
--threads_per_intf THREADS_PER_INTF
121125
parallel ping threads per local rdma interface,
122126
default is 16
@@ -162,11 +166,11 @@ NIC model to use (e.g MT2910 for CX-7) for filtering out RDMA interfaces from fr
162166

163167
**--objectstoreurl**
164168

165-
Pre-Authenticated Request(PAR) url where meshpinger logs will be uploaded. This can be used by customers to easily share meshpinger logs to OCI operator. OCI operator can create a PAR to objectstore bucket owned by them and share it with customer to enable them to share the logs
169+
Pre-Authenticated Request(PAR) url where meshpinger logs will be uploaded. This can be used by customers to easily share meshpinger logs with OCI during any incidents. OCI can provide a PAR to objectstore bucket and share it with customer to enable sharing of meshpinger logs.
166170

167-
**--singlesubnet**
171+
**--enable_inter_rail_ping**
168172

169-
This option specifies all rdma interfaces on hosts in the hostlist file are part of a single subnet. In this case meshpinger will do pings to all remote IPs from all local interfaces on a given host. It is to be noted that when this option is chosen net.ipv4.neigh.default.gc_threshX [X=1-3] sysctl setting on every host may need to be bumped up to hold the necessary arp entries per local interface. Eg. For running meshpinger on a 512 host cluster with each host having 16 rdma interface, size of the arp table should be atleast 130816(511 * 16 * 16). Accordingly it is recommended to set all the 3 sysctl thresholds - net.ipv4.neigh.default.gc_threshX[X=1-3] to 130816. Be default meshpinger assumes each rdma interface on a host is on a separate subnet and performs pings between rdma interfaces that have the same pci address.
173+
This option specifies all rdma interfaces on hosts in the hostlist file are part of a single subnet. In this case meshpinger will do pings to all remote IPs from all local interfaces on a given host. It is to be noted that when this option is chosen net.ipv4.neigh.default.gc_threshX [X=1-3] sysctl setting on every host may need to be bumped up to hold the necessary arp entries per local interface. Eg. For running meshpinger on a 512 host cluster with each host having 16 rdma interface, size of the arp table should be atleast 130816(511 * 16 * 16). Accordingly it is recommended to set all the 3 sysctl thresholds - net.ipv4.neigh.default.gc_threshX[X=1-3] to 130816. By default, meshpinger only pings along the rails.
170174

171175
**--threads_per_intf**
172176

0 commit comments

Comments
 (0)