You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+43-3Lines changed: 43 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -265,7 +265,7 @@ Example:
265
265
The name of the cluster must be
266
266
queueName-clusterNumber-instanceType_keyword
267
267
268
-
The keyword will need to match the one from /opt/oci-hpc/conf/queues.conf to be regirstered in Slurm
268
+
The keyword will need to match the one from /opt/oci-hpc/conf/queues.conf to be registered in Slurm
269
269
270
270
### Cluster Deletion:
271
271
```
@@ -293,8 +293,8 @@ Example of cluster command to add a new user:
293
293
```cluster user add name```
294
294
By default, a `privilege` group is created that has access to the NFS and can have sudo access on all nodes (Defined at the stack creation. This group has ID 9876) The group name can be modified.
295
295
```cluster user add name --gid 9876```
296
-
To generate a user-specific key for passwordless ssh between nodes, use --ssh.
297
-
```cluster user add name --ssh --gid 9876```
296
+
To avoid generating a user-specific key for passwordless ssh between nodes, use --nossh.
297
+
```cluster user add name --nossh --gid 9876```
298
298
299
299
# Shared home folder
300
300
@@ -318,3 +318,43 @@ $ max_nodes --> Information about all the partitions and their respective cluste
318
318
319
319
$ max_nodes --include_cluster_names xxx yyy zzz --> where xxx, yyy, zzz are cluster names. Provide a space separated list of cluster names to be considered for displaying the information about clusters and maximum number of nodes distributed evenly per partition
320
320
321
+
322
+
## validation.py usage
323
+
324
+
Use the alias "validate" to run the python script validation.py. You can run this script only from bastion.
325
+
326
+
The script performs these checks.
327
+
-> Check the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files.
328
+
-> PCIe bandwidth check
329
+
-> GPU Throttle check
330
+
-> Check whether md5 sum of /etc/hosts file on nodes matches that on bastion
331
+
332
+
Provide at least one argument: [-n NUM_NODES][-p PCIE][-g GPU_THROTTLE][-e ETC_HOSTS]
333
+
334
+
Optional argument with [-n NUM_NODES][-p PCIE][-g GPU_THROTTLE][-e ETC_HOSTS]: [-cn CLUSTER_NAMES]
335
+
Provide a file that lists each cluster on a separate line for which you want to validate the number of nodes and/or pcie check and/or gpu throttle check and/or /etc/hosts md5 sum.
336
+
337
+
For pcie, gpu throttle, and /etc/hosts md5 sum check, you can either provide y or Y along with -cn or you can give the hostfile path (each host on a separate line) for each argument. For number of nodes check, either provide y or give y along with -cn.
338
+
339
+
Below are some examples for running this script.
340
+
341
+
validate -n y --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. The clusters considered will be the default cluster if any and cluster(s) found in /opt/oci-hpc/autoscaling/clusters directory. The number of nodes considered will be from the resize script using the clusters we got before.
342
+
343
+
validate -n y -cn <clusternamefile> --> This will validate that the number of nodes is consistent across resize, /etc/hosts, slurm, topology.conf, OCI console, inventory files. It will also check whether md5 sum of /etc/hosts file on all nodes matches that on bastion. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
344
+
345
+
validate -p y -cn <clusternamefile> --> This will run the pcie bandwidth check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
346
+
347
+
validate -p <pciehostfile> --> This will run the pcie bandwidth check on the hosts provided in the file given. The pcie host file should have a host name on each line.
348
+
349
+
validate -g y -cn <clusternamefile> --> This will run the GPU throttle check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
350
+
351
+
validate -g <gpucheckhostfile> --> This will run the GPU throttle check on the hosts provided in the file given. The gpu check host file should have a host name on each line.
352
+
353
+
validate -e y -cn <clusternamefile> --> This will run the GPU throttle check. The clusters considered will be from the file specified by -cn option. The number of nodes considered will be from the resize script using the clusters from the file.
354
+
355
+
validate -e <md5sumcheckhostfile> --> This will run the /etc/hosts md5 sum check on the hosts provided in the file given. The md5 sum check host file should have a host name on each line.
356
+
357
+
You can combine all the options together such as:
358
+
validate -n y -p y -g y -e y -cn <clusternamefile>
0 commit comments