|
| 1 | + |
| 2 | +# OCI Meshpinger |
| 3 | + |
| 4 | +Meshpinger is a tool for validating network layer connectivity between RDMA NICs on a |
| 5 | +cluster network in OCI. The tool initiates an ICMP ping from every RDMA NIC |
| 6 | +port on the cluster network to every other RDMA NIC port on the same cluster network and |
| 7 | +reports back the success/failure status of the pings performed in the form of logs |
| 8 | + |
| 9 | +Running the tool before starting workload on a cluster network should serve as a good precheck |
| 10 | +step to gain confidence on the network reachability between RDMA NICs. Typical causes for |
| 11 | +reachability failures that the tool can help pinpoint are, |
| 12 | + |
| 13 | +1. Host rdma interface down |
| 14 | + |
| 15 | +2. Host rdma interface missing IP configuration |
| 16 | + |
| 17 | +3. Host rdma interface missing mac |
| 18 | + |
| 19 | +4. Host rdma interface enumeration issues |
| 20 | + |
| 21 | +5. Network connectivity issues between <src,dst> pair of IPs |
| 22 | + |
| 23 | +# Running Meshpinger |
| 24 | + |
| 25 | +Meshpinger is installed on the controller host of the hpc cluster. Once user is logged into the controller host, they can trigger meshpinger using the following options, |
| 26 | + |
| 27 | +- Run meshpinger on all hosts in the cluster. The cluster is auto-detected in this option. |
| 28 | +``` |
| 29 | +/opt/oci-hpc/healthchecks/run_meshpinger.sh |
| 30 | +``` |
| 31 | + |
| 32 | +- Run meshpinger on all hosts in the cluster explicitly specified by clustername |
| 33 | +``` |
| 34 | +/opt/oci-hpc/healthchecks/run_meshpinger.sh --hpcclustername <hpcclustername> |
| 35 | +``` |
| 36 | + |
| 37 | +Run meshpinger on a list of hosts specified in a file. A host can be specified by its IP address or hostname. It is expected that the host will be SSH-able from the controller host |
| 38 | +``` |
| 39 | +/opt/oci-hpc/healthchecks/run_meshpinger.sh --hostlisttfile <filename> |
| 40 | +``` |
| 41 | + |
| 42 | +# Output |
| 43 | + |
| 44 | +- All rdma interface configuration issues are reported like the sample below, |
| 45 | + |
| 46 | +``` |
| 47 | +Faulty RDMA interfaces(Link down/misconfigured) |
| 48 | +
|
| 49 | + Hostid/Serial/hostname Interface RDMA_IP PCI MAC Link Status |
| 50 | +-- -------------------------- ----------- --------- ------------ ----------------- ------------- |
| 51 | + 0 GPU-711/2109XCL016/GPU-711 rdma1 0.0.0.0 0000:98:00.1 b8:ce:f6:00:12:29 DOWN |
| 52 | + 1 GPU-278/2110XCL04V/GPU-278 rdma1 0.0.0.0 0000:98:00.1 04:3f:72:e0:6b:0d DOWN |
| 53 | +``` |
| 54 | + |
| 55 | +- If there are ping failures from the run, total number of unique <srcInterface,dstInterface> pings that failed per host is printed as a table like the sample below, |
| 56 | + |
| 57 | +``` |
| 58 | +ICMP ping failures per host |
| 59 | +
|
| 60 | + Hostid/Serial/Hostname Total Failures |
| 61 | +-- -------------------------- ---------------- |
| 62 | + 0 GPU-711/2109XCL016/GPU-711 1 |
| 63 | + 1 GPU-278/2110XCL04V/GPU-278 1 |
| 64 | +``` |
| 65 | +Logfile of the current run that enumerates all <srcInterface,dstInterface> combinations that failed ping is printed like, |
| 66 | + |
| 67 | +``` |
| 68 | +<src,dst> interfaces that failed ping are listed at end of the log file meshpinger_log_20241008220615_ocid1.tenancy.oc1..aaaaaaaabddc4obuhgvifcrh6esmw6554ityaqrvxulcksl255gbwehtcq.txt |
| 69 | +``` |
| 70 | + |
| 71 | + |
| 72 | +- If there are no ping failures from the run, following message is printed |
| 73 | + |
| 74 | +``` |
| 75 | +All pings succeeded!! |
| 76 | +``` |
| 77 | +- Cluster information that includes rdma interface details gathered from the run is stored in a file cluster_info.txt in the current directory, same is printed as below, |
| 78 | +``` |
| 79 | +clusterinfo file - cluster_info.txt |
| 80 | +``` |
| 81 | + |
| 82 | +# Options |
| 83 | +Other options supported are shown in the help text below. |
| 84 | + |
| 85 | +``` |
| 86 | +/opt/oci-hpc/healthchecks/run_meshpinger.sh --help |
| 87 | +
|
| 88 | +usage: ./run_meshpinger.sh [-h] |
| 89 | + [--hostlistfile HOSTLISTFILE | --hpcclustername HPCCLUSTERNAME] |
| 90 | + [--clusterinfo CLUSTERINFO] [--ssh_port SSH_PORT] |
| 91 | + [--ping_timeout PING_TIMEOUT] |
| 92 | + [--dump_arp_on_failure] [--flush_arp] |
| 93 | + [--nic_model NIC_MODEL] |
| 94 | + [--objectstoreurl OBJECTSTOREURL] [--singlesubnet] |
| 95 | + [--threads_per_intf THREADS_PER_INTF] [--verbose] |
| 96 | +
|
| 97 | +optional arguments: |
| 98 | + -h, --help show this help message and exit |
| 99 | + --hostlistfile HOSTLISTFILE |
| 100 | + File listing name/ip of the hosts to include in |
| 101 | + meshping |
| 102 | + --hpcclustername HPCCLUSTERNAME |
| 103 | + OCI HPC stack clustername |
| 104 | + --clusterinfo CLUSTERINFO |
| 105 | + Use this cluster info file (generated from previous |
| 106 | + runs) and skip gathering cluster information in this |
| 107 | + run |
| 108 | + --ssh_port SSH_PORT ssh port to use, port 22 will be used if not specified |
| 109 | + --ping_timeout PING_TIMEOUT |
| 110 | + Duration ping waits for reply before timing out, |
| 111 | + default is 1sec |
| 112 | + --dump_arp_on_failure |
| 113 | + Log arp entry for failed pings |
| 114 | + --flush_arp Flush arp cache before starting pinger |
| 115 | + --nic_model NIC_MODEL |
| 116 | + Model of the RDMA NIC eg. MT2910(CX-7) to use if auto |
| 117 | + detect fails |
| 118 | + --objectstoreurl OBJECTSTOREURL |
| 119 | + ObjectStore PAR URL where mesh pinger logs will be |
| 120 | + uploaded |
| 121 | + --enable_inter_rail_ping |
| 122 | + Include this argument to perform pings across the rails. |
| 123 | + If so pinger will do a full mesh ping |
| 124 | + --threads_per_intf THREADS_PER_INTF |
| 125 | + parallel ping threads per local rdma interface, |
| 126 | + default is 16 |
| 127 | + --verbose Log all debug messages including successful pings. |
| 128 | + Default is to log only failed pings |
| 129 | +``` |
| 130 | + |
| 131 | +# Description |
| 132 | +Detailed description of each option is below, |
| 133 | + |
| 134 | +**--hostlistfile** |
| 135 | + |
| 136 | +Path to file containing the list of hosts to be used for current meshpinger run. A host can be specified by its IP address or its hostname but it should be SSH-able via either of these 2 strings specified. String specified here is listed as Hostid on the final report of meshping run |
| 137 | + |
| 138 | +**--hpcclustername** |
| 139 | + |
| 140 | +Clustername specified when the cluster was created using OCI HPC stack |
| 141 | + |
| 142 | + |
| 143 | +**--clusterinfo** |
| 144 | + |
| 145 | +File containing cluster information generated from a previous meshpinger run. When this is specified, current run will skip gathering RDMA interface details from the hosts and move on to doing actual meshping tests saving some runtime. Note that specifying this option forces meshpinger to use RDMA interface details collected previously which could be stale especially for attributes like link state, IP assignment |
| 146 | + |
| 147 | +**--ssh_port** |
| 148 | + |
| 149 | +Port to use for ssh to hosts specified in the hostlistfile. By default port 22 will be used if this is not specified |
| 150 | + |
| 151 | +**--ping_timeout** |
| 152 | + |
| 153 | +Time in milliseconds that ping waits for a successful reply from remote IP including the time it takes for arp resolution. This timeout is 1 second by default if this option is not specified and overall meshpinger performs 10 retries for each of the remote IPs before marking it as a ping failure |
| 154 | + |
| 155 | +**--dump_arp_on_failure** |
| 156 | + |
| 157 | +When this option is specified, for each of the ping failures the corresponding arp table entry(including the status field) for the remote IP on the local host will be dumped in meshpinger logs. By default this is disabled |
| 158 | + |
| 159 | +**--flush_arp** |
| 160 | + |
| 161 | +When this option is specified, meshpinger will flush the arp table on each of the hosts before starting the ping validation test |
| 162 | + |
| 163 | +**--nic_model** |
| 164 | + |
| 165 | +NIC model to use (e.g MT2910 for CX-7) for filtering out RDMA interfaces from front-end network interfaces while gathering RDMA interface information on each host. By default, meshpinger determines the model based on the model of majority of interfaces on the host given that backend network interface count always exceeds frontend network interface count. |
| 166 | + |
| 167 | +**--objectstoreurl** |
| 168 | + |
| 169 | +Pre-Authenticated Request(PAR) url where meshpinger logs will be uploaded. This can be used by customers to easily share meshpinger logs with OCI during any incidents. OCI can provide a PAR to objectstore bucket and share it with customer to enable sharing of meshpinger logs. |
| 170 | + |
| 171 | +**--enable_inter_rail_ping** |
| 172 | + |
| 173 | +This option specifies all rdma interfaces on hosts in the hostlist file are part of a single subnet. In this case meshpinger will do pings to all remote IPs from all local interfaces on a given host. It is to be noted that when this option is chosen net.ipv4.neigh.default.gc_threshX [X=1-3] sysctl setting on every host may need to be bumped up to hold the necessary arp entries per local interface. Eg. For running meshpinger on a 512 host cluster with each host having 16 rdma interface, size of the arp table should be atleast 130816(511 * 16 * 16). Accordingly it is recommended to set all the 3 sysctl thresholds - net.ipv4.neigh.default.gc_threshX[X=1-3] to 130816. By default, meshpinger only pings along the rails. |
| 174 | + |
| 175 | +**--threads_per_intf** |
| 176 | + |
| 177 | +By default meshpinger running on each of the hosts in the hostlist file uses 16 parallel threads per local interface to perform parallel pings. This option overrides that setting with allowed values of 1-32 |
| 178 | + |
| 179 | +**--verbose** |
| 180 | + |
| 181 | +By default only ping failures are logged to limit the log file size. When this option is specified succeeding pings are also logged |
| 182 | + |
| 183 | + |
0 commit comments