You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_06132023191024
The nvidia bug report, sosreport, and console history logs for inst-jxwf6-keen-drake are at /home/ubuntu/inst-jxwf6-keen-drake_11112022001138
386
+
387
+
for x in `less /home/opc/hostlist` ; do echo $x ; python3 collect_logs.py --hostname $x; done ;
388
+
compute-permanent-node-467
389
+
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-467 are at /home/ubuntu/compute-permanent-node-467_11112022011318
390
+
compute-permanent-node-787
391
+
The nvidia bug report, sosreport, and console history logs for compute-permanent-node-787 are at /home/ubuntu/compute-permanent-node-787_11112022011835
392
+
393
+
Where hostlist had the below contents
394
+
compute-permanent-node-467
395
+
compute-permanent-node-787
396
+
397
+
398
+
## Collect RDMA NIC Metrics and Upload to Object Storage
399
+
400
+
OCI-HPC is deployed in customer tenancy. So, OCI service teams cannot access metrics from these OCI-HPC stack clusters. Due to overcome this issue, in release,
401
+
we introduce a feature to collect RDMA NIC Metrics and upload those metrics to Object Storage. Later on, that Object Storage URL could be shared with OCI service
402
+
teams. After that URL, OCI service teams could access metrics and use those metrics for debugging purpose.
403
+
404
+
To collect RDMA NIC Metrics and upload those to Object Storage, user needs to follow these following steps:
405
+
406
+
Step 1: Create a PAR (PreAuthenticated Request)
407
+
For creating a PAR, user needs to select check-box "Create Object Storage PAR" during Resource Manager's stack creation.
408
+
By default, this check box is enabled. By selecting, this check-box, a PAR would be created.
409
+
410
+
Step 2: Use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage.
411
+
User needs to use shell script: upload_rdma_nic_metrics.sh to collect metrics and upload to object storage. User could configure metrics
412
+
collection limit and interval through config file: rdma_metrics_collection_config.conf.
0 commit comments