Skip to content

Commit d146dbe

Browse files
Merge pull request #71 from oracle-quickstart/active-tests-preview
Add readme and manifests for running NCCL-tests as active health checks
2 parents 5d8203d + 6857cd8 commit d146dbe

File tree

2 files changed

+995
-0
lines changed

2 files changed

+995
-0
lines changed
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Running Active Health Checks (preview)
2+
3+
> [!NOTE]
4+
> This is a preview feature. We are actively adding more tests.
5+
6+
This readme contains the manifests required to run the NCCL-tests active health checks on GPU nodes using Volcano. It includes a smart applier Kubernetes CronJob that only schedules tests on idle nodes that were not already tested in the last 24 hours (configurable).
7+
8+
## Node selection logic
9+
10+
When the CronJob runs, the applier script performs the following steps:
11+
12+
1. **Enumerate GPU nodes**: Look for nodes that have the `nvidia.com/gpu=true` label.
13+
2. **Check current usage:** It sums the GPU requests across running pods on each node. Only nodes with zero GPU usage are considered idle.
14+
3. **Exclude recently tested nodes:** If a node is labeled `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` within the last 24 hours, it is skipped.
15+
4. **Require at least two nodes:** Both worker nodes must be available. If fewer than two nodes remain, the job exits gracefully.
16+
5. **Shape detection:** The selected node’s `node.kubernetes.io/instance-type` label determines which ConfigMap manifest to apply.
17+
6. **Job creation:** A Volcano `Job` is created with a launcher (`mpimaster`) and workers (`mpiworker`). The launcher waits for SSH connectivity to the workers before running the NCCL test.
18+
7. **Label updates:** After the run, nodes are labeled with the latest result and timestamp.
19+
20+
If all nodes are excluded (either busy or already tested), the job exits without creating a Volcano job, logging the reason.
21+
22+
> [!NOTE]
23+
> The test jobs run as low priority jobs and can be evicted if higher priority workloads are pending. This ensures that health checks do not interfere with production workloads.
24+
25+
## Usage
26+
The manifest assumes there's a namespace called `monitoring`. If you want to deploy to another namespace, edit the manifest accordingly.
27+
28+
1. **Apply manifests**
29+
```bash
30+
kubectl apply -f https://github.com/oracle-quickstart/oci-hpc-oke/tree/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
31+
```
32+
33+
2. **Run ad-hoc test**
34+
```bash
35+
kubectl create job -n monitoring --from=cronjob/active-health-checks-nccl-tests-applier test-$(date +%s)
36+
37+
kubectl logs -n monitoring job/test-<timestamp>
38+
```
39+
40+
3. **Watch Volcano job**
41+
```bash
42+
kubectl get pods -n monitoring -l volcano.sh/job-name=<job-name>
43+
44+
kubectl logs -n monitoring <launcher-pod>
45+
```
46+
47+
4. **Clean up**
48+
```bash
49+
kubectl delete job -n monitoring -l job-name
50+
51+
kubectl delete cronjob active-health-checks-nccl-tests-applier -n monitoring
52+
```
53+

0 commit comments

Comments
 (0)