|
| 1 | +# Running Active Health Checks (preview) |
| 2 | + |
| 3 | +> [!NOTE] |
| 4 | +> This is a preview feature. We are actively adding more tests. |
| 5 | +
|
| 6 | +This readme contains the manifests required to run the NCCL-tests active health checks on GPU nodes using Volcano. It includes a smart applier Kubernetes CronJob that only schedules tests on idle nodes that were not already tested in the last 24 hours (configurable). |
| 7 | + |
| 8 | +## Node selection logic |
| 9 | + |
| 10 | +When the CronJob runs, the applier script performs the following steps: |
| 11 | + |
| 12 | +1. **Enumerate GPU nodes**: Look for nodes that have the `nvidia.com/gpu=true` label. |
| 13 | +2. **Check current usage:** It sums the GPU requests across running pods on each node. Only nodes with zero GPU usage are considered idle. |
| 14 | +3. **Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` within the last 24 hours, it is skipped. |
| 15 | +4. **Require at least two nodes:** Both worker nodes must be available. If fewer than two nodes remain, the job exits gracefully. |
| 16 | +5. **Shape detection:** The selected node’s `node.kubernetes.io/instance-type` label determines which ConfigMap manifest to apply. |
| 17 | +6. **Job creation:** A Volcano `Job` is created with a launcher (`mpimaster`) and workers (`mpiworker`). The launcher waits for SSH connectivity to the workers before running the NCCL test. |
| 18 | +7. **Label updates:** After the run, nodes are labeled with the latest result and timestamp. |
| 19 | + |
| 20 | +If all nodes are excluded (either busy or already tested), the job exits without creating a Volcano job, logging the reason. |
| 21 | + |
| 22 | +> [!NOTE] |
| 23 | +> The test jobs run as low priority jobs and can be evicted if higher priority workloads are pending. This ensures that health checks do not interfere with production workloads. |
| 24 | +
|
| 25 | +## Usage |
| 26 | +The manifest assumes there's a namespace called `monitoring`. If you want to deploy to another namespace, edit the manifest accordingly. |
| 27 | + |
| 28 | +1. **Apply manifests** |
| 29 | + ```bash |
| 30 | + kubectl apply -f https://github.com/oracle-quickstart/oci-hpc-oke/tree/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml |
| 31 | + ``` |
| 32 | + |
| 33 | +2. **Run ad-hoc test** |
| 34 | + ```bash |
| 35 | + kubectl create job -n monitoring --from=cronjob/active-health-checks-nccl-tests-applier test-$(date +%s) |
| 36 | + |
| 37 | + kubectl logs -n monitoring job/test-<timestamp> |
| 38 | + ``` |
| 39 | + |
| 40 | +3. **Watch Volcano job** |
| 41 | + ```bash |
| 42 | + kubectl get pods -n monitoring -l volcano.sh/job-name=<job-name> |
| 43 | + |
| 44 | + kubectl logs -n monitoring <launcher-pod> |
| 45 | + ``` |
| 46 | + |
| 47 | +4. **Clean up** |
| 48 | + ```bash |
| 49 | + kubectl delete job -n monitoring -l job-name |
| 50 | + |
| 51 | + kubectl delete cronjob active-health-checks-nccl-tests-applier -n monitoring |
| 52 | + ``` |
| 53 | + |
0 commit comments