Skip to content

Commit 781eaba

Browse files
Merge pull request #85 from oracle-quickstart/25.10.0
Add v25.10.0
2 parents 41474c5 + 46dd16f commit 781eaba

30 files changed

+2096
-854
lines changed

docs/deploying-monitoring-stack-manually.md

Lines changed: 827 additions & 0 deletions
Large diffs are not rendered by default.

docs/running-active-health-checks.md

Lines changed: 64 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@ Active health checks provide automated, periodic validation of GPU & RDMA functi
99

1010
### Available Health Check Types
1111

12-
Three types of active health checks are available:
12+
Five types of active health checks are available:
1313

14-
1. **NCCL Tests** - Multi-node GPU communication tests using NVIDIA NCCL
15-
2. **GPU Fryer** - Single-node GPU stress testing
16-
3. **DCGM Diagnostics** - Host-level GPU diagnostics using NVIDIA DCGM
14+
1. **NCCL Tests** - Multi-node GPU communication tests using NVIDIA NCCL (NVIDIA GPUs)
15+
2. **RCCL Tests** - Multi-node GPU communication tests using AMD RCCL (AMD GPUs)
16+
3. **GPU Fryer** - Single-node GPU stress testing (NVIDIA GPUs)
17+
4. **RVS** - Single-node GPU validation using ROCm Validation Suite (AMD GPUs)
18+
5. **DCGM Diagnostics** - Host-level GPU diagnostics using NVIDIA DCGM (NVIDIA GPUs)
1719

1820
### How It Works
1921

@@ -27,8 +29,8 @@ Each health check runs as a CronJob that:
2729

2830
- OKE cluster with GPU nodes
2931
- kubectl access with cluster-admin privileges
30-
- Kueue installed (for NCCL tests)
31-
- MPI Operator installed (for NCCL tests)
32+
- Kueue installed
33+
- MPI Operator installed (for NCCL and RCCL tests)
3234
- Monitoring namespace (or permission to create it)
3335

3436
## Architecture
@@ -51,7 +53,9 @@ Each health check applies two labels to tested nodes:
5153
| Health Check | Pass/Fail Label | Timestamp Label |
5254
|--------------|----------------|-----------------|
5355
| NCCL Tests | `oke.oraclecloud.com/active-health-checks-nccl-tests` | `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` |
56+
| RCCL Tests | `oke.oraclecloud.com/active-health-checks-rccl-tests` | `oke.oraclecloud.com/active-health-checks-rccl-tests-last-run` |
5457
| GPU Fryer | `oke.oraclecloud.com/active-health-checks-gpu-fryer` | `oke.oraclecloud.com/active-health-checks-gpu-fryer-last-run` |
58+
| RVS | `oke.oraclecloud.com/active-health-checks-rvs` | `oke.oraclecloud.com/active-health-checks-rvs-last-run` |
5559
| DCGM Diagnostics | `oke.oraclecloud.com/active-health-checks-dcgm-diag` | `oke.oraclecloud.com/active-health-checks-dcgm-diag-last-run` |
5660

5761
Label values:
@@ -60,7 +64,7 @@ Label values:
6064

6165
## RBAC Permissions
6266

63-
All three health checks use the same RBAC configuration:
67+
All five health checks use the same RBAC configuration:
6468

6569
- **ServiceAccount**: `active-health-checks-runner` (in `monitoring` namespace)
6670
- **ClusterRole**: `active-health-checks-runner-role`
@@ -77,21 +81,28 @@ The RBAC permissions allow the health check jobs to:
7781
Install Kueue and MPI Operator (required for NCCL tests):
7882

7983
```bash
80-
helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.1" --create-namespace --namespace=kueue-system
84+
helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.2" --create-namespace --namespace=kueue-system
8185

8286
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
8387
```
8488

8589
### Step 2: Deploy Active Health Checks
8690

87-
Deploy all three health check CronJobs:
91+
Deploy all health check CronJobs:
8892

93+
**For NVIDIA GPU clusters:**
8994
```bash
9095
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
9196
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
9297
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml
9398
```
9499

100+
**For AMD GPU clusters:**
101+
```bash
102+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
103+
kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml
104+
```
105+
95106
### Step 3: Verify Deployment
96107

97108
Check that the CronJobs have been created:
@@ -100,7 +111,7 @@ Check that the CronJobs have been created:
100111
kubectl get cronjobs -n monitoring
101112
```
102113

103-
**Example output:**
114+
**Example output (NVIDIA GPU clusters):**
104115

105116
```
106117
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
@@ -109,18 +120,29 @@ active-health-checks-gpu-fryer-applier 0 * * * * False 0 <non
109120
active-health-checks-nccl-tests-applier 0 * * * * False 0 <none> 10s
110121
```
111122

123+
**Example output (AMD GPU clusters):**
124+
125+
```
126+
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
127+
active-health-checks-rccl-tests-applier 0 * * * * False 0 <none> 10s
128+
active-health-checks-rvs-applier 0 * * * * False 0 <none> 10s
129+
```
130+
112131
## Node Selection Logic
113132

114133
All health checks follow this selection process:
115134

116-
1. **Find GPU Nodes**: Query nodes with `nvidia.com/gpu=true` label
135+
1. **Find GPU Nodes**: Query nodes with appropriate GPU label
136+
- NVIDIA tests: `nvidia.com/gpu=true` label
137+
- AMD tests: `amd.com/gpu=true` label
117138
2. **Check Idle Status**: Calculate GPU usage from pod requests
118139
- Only nodes with 0 GPU allocation are considered
119140
3. **Check Last Run**: Parse `*-last-run` timestamp label
120141
- Skip nodes tested today (same UTC date)
121142
4. **Select Nodes**:
122-
- NCCL: Pick 2+ nodes of same shape
143+
- NCCL/RCCL: Pick 2+ nodes of same shape
123144
- GPU Fryer: Pick 1 node
145+
- RVS: Pick 1 node
124146
- DCGM: Pick 1 node
125147

126148
This ensures:
@@ -140,18 +162,29 @@ kubectl get node <node-name> --show-labels | grep active-health-checks
140162

141163
View all nodes with their health check labels:
142164

165+
**For NVIDIA GPU nodes:**
143166
```bash
144167
kubectl get nodes -o custom-columns=NAME:.metadata.name,NCCL:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-nccl-tests,GPU_FRYER:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-gpu-fryer,DCGM:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-dcgm-diag
145168
```
146169

170+
**For AMD GPU nodes:**
171+
```bash
172+
kubectl get nodes -o custom-columns=NAME:.metadata.name,RCCL:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-rccl-tests,RVS:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-rvs
173+
```
174+
147175
### Identify Failed Nodes
148176

149177
List nodes that have failed any health check:
150178

151179
```bash
180+
# NVIDIA GPU nodes
152181
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-nccl-tests=fail -o wide
153182
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-gpu-fryer=fail -o wide
154183
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-dcgm-diag=fail -o wide
184+
185+
# AMD GPU nodes
186+
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rccl-tests=fail -o wide
187+
kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rvs=fail -o wide
155188
```
156189

157190
### View Health Check Job Logs
@@ -172,15 +205,25 @@ To manually trigger a health check outside the regular schedule:
172205

173206
```bash
174207
# Create a one-off job from the CronJob
208+
# NVIDIA GPU tests
175209
kubectl create job -n monitoring manual-nccl-test --from=cronjob/active-health-checks-nccl-tests-applier
176210
kubectl create job -n monitoring manual-fryer-test --from=cronjob/active-health-checks-gpu-fryer-applier
177211
kubectl create job -n monitoring manual-dcgm-test --from=cronjob/active-health-checks-dcgm-diag-applier
212+
213+
# AMD GPU tests
214+
kubectl create job -n monitoring manual-rccl-test --from=cronjob/active-health-checks-rccl-tests-applier
215+
kubectl create job -n monitoring manual-rvs-test --from=cronjob/active-health-checks-rvs-applier
178216
```
179217

180218
To run a test immediately on a specific node, you can temporarily modify the node labels to remove the last-run timestamp:
181219

182220
```bash
221+
# For NVIDIA nodes
183222
kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-nccl-tests-last-run-
223+
224+
# For AMD nodes
225+
kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-rccl-tests-last-run-
226+
kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-rvs-last-run-
184227
```
185228

186229
The next CronJob execution will then select this node for testing.
@@ -202,7 +245,9 @@ By default, health checks run every hour (`0 * * * *`). To modify the schedule:
202245

203246
Each health check manifest can be customized with different parameters:
204247
- **NCCL Tests**: Number of nodes, GPU count, NCCL parameters
248+
- **RCCL Tests**: Number of nodes, GPU count, RCCL parameters
205249
- **GPU Fryer**: Stress duration, temperature thresholds
250+
- **RVS**: Test recipe, iterations, timeout, validation tests
206251
- **DCGM Diagnostics**: Diagnostic level, specific tests to run
207252

208253
Download and modify the manifests locally before applying them for custom configurations.
@@ -223,11 +268,18 @@ kubectl patch cronjob active-health-checks-nccl-tests-applier -n monitoring -p '
223268

224269
To remove active health checks:
225270

271+
**For NVIDIA GPU clusters:**
226272
```bash
227273
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
228274
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
229275
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml
230276
```
231277

278+
**For AMD GPU clusters:**
279+
```bash
280+
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
281+
kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml
282+
```
283+
232284
> [!NOTE]
233285
> Node labels applied by health checks will remain after uninstalling. To remove them, manually delete the labels from each node.

manifests/active-health-checks/active-health-checks-dcgm-diag.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
apiVersion: scheduling.k8s.io/v1
33
kind: PriorityClass
44
metadata:
5-
name: active-health-checks-low
6-
value: -1000
5+
name: active-health-checks-dcgm-diag-low
6+
value: -100000
77
globalDefault: false
88
preemptionPolicy: PreemptLowerPriority
99
description: "Very low priority for active health check jobs to be preempted by others"
@@ -139,7 +139,7 @@ spec:
139139
labels:
140140
app: dcgm-diag-test
141141
spec:
142-
priorityClassName: active-health-checks-low
142+
priorityClassName: active-health-checks-dcgm-diag-low
143143
restartPolicy: Never
144144
nodeSelector:
145145
kubernetes.io/hostname: $test_node

manifests/active-health-checks/active-health-checks-gpu-fryer.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
apiVersion: scheduling.k8s.io/v1
33
kind: PriorityClass
44
metadata:
5-
name: active-health-checks-low
6-
value: -1000
5+
name: active-health-checks-gpu-fryer-low
6+
value: -100000
77
globalDefault: false
88
preemptionPolicy: PreemptLowerPriority
99
description: "Very low priority for active health check jobs to be preempted by others"
@@ -138,7 +138,7 @@ spec:
138138
labels:
139139
app: gpu-fryer-test
140140
spec:
141-
priorityClassName: active-health-checks-low
141+
priorityClassName: active-health-checks-gpu-fryer-low
142142
restartPolicy: Never
143143
nodeSelector:
144144
kubernetes.io/hostname: $test_node

manifests/active-health-checks/active-health-checks-nccl-tests.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
apiVersion: scheduling.k8s.io/v1
33
kind: PriorityClass
44
metadata:
5-
name: active-health-checks-low
6-
value: -1000
5+
name: active-health-checks-nccl-tests-low
6+
value: -100000
77
globalDefault: false
88
preemptionPolicy: PreemptLowerPriority
99
description: "Very low priority for active health check jobs to be preempted by others"
@@ -111,7 +111,7 @@ data:
111111
labels:
112112
nccl-test-replica: mpi-launcher
113113
spec:
114-
priorityClassName: active-health-checks-low
114+
priorityClassName: active-health-checks-nccl-tests-low
115115
hostNetwork: true
116116
dnsPolicy: ClusterFirstWithHostNet
117117
# NODE_AFFINITY_PLACEHOLDER
@@ -184,7 +184,7 @@ data:
184184
labels:
185185
nccl-test-replica: mpi-worker
186186
spec:
187-
priorityClassName: active-health-checks-low
187+
priorityClassName: active-health-checks-nccl-tests-low
188188
hostNetwork: true
189189
dnsPolicy: ClusterFirstWithHostNet
190190
containers:
@@ -286,7 +286,7 @@ data:
286286
labels:
287287
nccl-test-replica: mpi-launcher
288288
spec:
289-
priorityClassName: active-health-checks-low
289+
priorityClassName: active-health-checks-nccl-tests-low
290290
hostNetwork: true
291291
dnsPolicy: ClusterFirstWithHostNet
292292
# NODE_AFFINITY_PLACEHOLDER
@@ -355,7 +355,7 @@ data:
355355
labels:
356356
nccl-test-replica: mpi-worker
357357
spec:
358-
priorityClassName: active-health-checks-low
358+
priorityClassName: active-health-checks-nccl-tests-low
359359
hostNetwork: true
360360
dnsPolicy: ClusterFirstWithHostNet
361361
containers:

0 commit comments

Comments
 (0)