Skip to content

Commit e35a7a7

Browse files
Update readme
1 parent 708e9b0 commit e35a7a7

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

docs/running-active-health-checks.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ This readme contains the manifests required to run the NCCL-tests active health
99

1010
When the CronJob runs, the applier script performs the following steps:
1111

12-
1. **Enumerate GPU nodes** Look for nodes that have (`nvidia.com/gpu=true`) label.
12+
1. **Enumerate GPU nodes**: Look for nodes that have the `nvidia.com/gpu=true` label.
1313
2. **Check current usage:** It sums the GPU requests across running pods on each node. Only nodes with zero GPU usage are considered idle.
14-
3. **Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` with in the last 24 hours, it is skipped.
14+
3. **Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` within the last 24 hours, it is skipped.
1515
4. **Require at least two nodes:** Both worker nodes must be available. If fewer than two nodes remain, the job exits gracefully.
1616
5. **Shape detection:** The selected node’s `node.kubernetes.io/instance-type` label determines which ConfigMap manifest to apply.
1717
6. **Job creation:** A Volcano `Job` is created with a launcher (`mpimaster`) and workers (`mpiworker`). The launcher waits for SSH connectivity to the workers before running the NCCL test.
@@ -31,7 +31,7 @@ The manifest assumes there's a namespace called `monitoring`. If you want to dep
3131
```bash
3232
kubectl create job -n monitoring --from=cronjob/active-health-checks-nccl-tests-applier test-$(date +%s)
3333

34-
kubectl logs -n monitoring job/shape-test-<timestamp>
34+
kubectl logs -n monitoring job/test-<timestamp>
3535
```
3636

3737
3. **Watch Volcano job**

0 commit comments

Comments
 (0)