You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/running-active-health-checks.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,9 +9,9 @@ This readme contains the manifests required to run the NCCL-tests active health
9
9
10
10
When the CronJob runs, the applier script performs the following steps:
11
11
12
-
1.**Enumerate GPU nodes** Look for nodes that have (`nvidia.com/gpu=true`) label.
12
+
1.**Enumerate GPU nodes**: Look for nodes that have the `nvidia.com/gpu=true` label.
13
13
2.**Check current usage:** It sums the GPU requests across running pods on each node. Only nodes with zero GPU usage are considered idle.
14
-
3.**Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run`with in the last 24 hours, it is skipped.
14
+
3.**Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run`within the last 24 hours, it is skipped.
15
15
4.**Require at least two nodes:** Both worker nodes must be available. If fewer than two nodes remain, the job exits gracefully.
16
16
5.**Shape detection:** The selected node’s `node.kubernetes.io/instance-type` label determines which ConfigMap manifest to apply.
17
17
6.**Job creation:** A Volcano `Job` is created with a launcher (`mpimaster`) and workers (`mpiworker`). The launcher waits for SSH connectivity to the workers before running the NCCL test.
@@ -31,7 +31,7 @@ The manifest assumes there's a namespace called `monitoring`. If you want to dep
0 commit comments