Skip to content

Commit d5c317f

Browse files
Add active health checks doc and manifest
1 parent 5d8203d commit d5c317f

File tree

2 files changed

+992
-0
lines changed

2 files changed

+992
-0
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Running Active Health Checks (preview)
2+
3+
> [!NOTE]
4+
> This is a preview feature. We are actively adding more tests.
5+
6+
This directory contains the manifests required to run the NCCL-based active health checks on GPU nodes using Volcano. It includes a smart applier CronJob that only schedules tests on idle nodes that were not already tested in the last 24 hours (configurable).
7+
8+
## Node selection logic
9+
10+
When the CronJob runs, the applier script performs the following steps:
11+
12+
1. **Enumerate GPU nodes** Look for nodes that have (`nvidia.com/gpu=true`) label.
13+
2. **Check current usage:** It sums the GPU requests across running pods on each node. Only nodes with zero GPU usage are considered idle.
14+
3. **Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` with in the last 24 hours, it is skipped.
15+
4. **Require at least two nodes:** Both worker nodes must be available. If fewer than two nodes remain, the job exits gracefully.
16+
5. **Shape detection:** The selected node’s `node.kubernetes.io/instance-type` label determines which ConfigMap manifest to apply.
17+
6. **Job creation:** A Volcano `Job` is created with a launcher (`mpimaster`) and workers (`mpiworker`). The launcher waits for SSH connectivity to the workers before running the NCCL test.
18+
7. **Label updates:** After the run, nodes are labeled with the latest result and timestamp.
19+
20+
If all nodes are excluded (either busy or already tested), the job exits without creating a Volcano job, logging the reason.
21+
22+
## Usage
23+
The manifest assumes there's a namespace called `monitoring`. If you want to deploy to another namespace, edit the manifest accordingly.
24+
25+
1. **Apply manifests**
26+
```bash
27+
kubectl apply -f https://github.com/oracle-quickstart/oci-hpc-oke/tree/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
28+
```
29+
30+
2. **Run ad-hoc test**
31+
```bash
32+
kubectl create job -n monitoring --from=cronjob/active-health-checks-nccl-tests-applier test-$(date +%s)
33+
34+
kubectl logs -n monitoring job/shape-test-<timestamp>
35+
```
36+
37+
3. **Watch Volcano job**
38+
```bash
39+
kubectl get pods -n monitoring -l volcano.sh/job-name=<job-name>
40+
41+
kubectl logs -n monitoring <launcher-pod>
42+
```
43+
44+
4. **Clean up**
45+
```bash
46+
kubectl delete job -n monitoring -l job-name
47+
48+
kubectl delete cronjob active-health-checks-nccl-tests-applier -n monitoring
49+
```
50+

0 commit comments

Comments
 (0)