@@ -9,11 +9,13 @@ Active health checks provide automated, periodic validation of GPU & RDMA functi
99
1010### Available Health Check Types
1111
12- Three types of active health checks are available:
12+ Five types of active health checks are available:
1313
14- 1 . ** NCCL Tests** - Multi-node GPU communication tests using NVIDIA NCCL
15- 2 . ** GPU Fryer** - Single-node GPU stress testing
16- 3 . ** DCGM Diagnostics** - Host-level GPU diagnostics using NVIDIA DCGM
14+ 1 . ** NCCL Tests** - Multi-node GPU communication tests using NVIDIA NCCL (NVIDIA GPUs)
15+ 2 . ** RCCL Tests** - Multi-node GPU communication tests using AMD RCCL (AMD GPUs)
16+ 3 . ** GPU Fryer** - Single-node GPU stress testing (NVIDIA GPUs)
17+ 4 . ** RVS** - Single-node GPU validation using ROCm Validation Suite (AMD GPUs)
18+ 5 . ** DCGM Diagnostics** - Host-level GPU diagnostics using NVIDIA DCGM (NVIDIA GPUs)
1719
1820### How It Works
1921
@@ -27,8 +29,8 @@ Each health check runs as a CronJob that:
2729
2830- OKE cluster with GPU nodes
2931- kubectl access with cluster-admin privileges
30- - Kueue installed (for NCCL tests)
31- - MPI Operator installed (for NCCL tests)
32+ - Kueue installed
33+ - MPI Operator installed (for NCCL and RCCL tests)
3234- Monitoring namespace (or permission to create it)
3335
3436## Architecture
@@ -51,7 +53,9 @@ Each health check applies two labels to tested nodes:
5153| Health Check | Pass/Fail Label | Timestamp Label |
5254| --------------| ----------------| -----------------|
5355| NCCL Tests | ` oke.oraclecloud.com/active-health-checks-nccl-tests ` | ` oke.oraclecloud.com/active-health-checks-nccl-tests-last-run ` |
56+ | RCCL Tests | ` oke.oraclecloud.com/active-health-checks-rccl-tests ` | ` oke.oraclecloud.com/active-health-checks-rccl-tests-last-run ` |
5457| GPU Fryer | ` oke.oraclecloud.com/active-health-checks-gpu-fryer ` | ` oke.oraclecloud.com/active-health-checks-gpu-fryer-last-run ` |
58+ | RVS | ` oke.oraclecloud.com/active-health-checks-rvs ` | ` oke.oraclecloud.com/active-health-checks-rvs-last-run ` |
5559| DCGM Diagnostics | ` oke.oraclecloud.com/active-health-checks-dcgm-diag ` | ` oke.oraclecloud.com/active-health-checks-dcgm-diag-last-run ` |
5660
5761Label values:
@@ -60,7 +64,7 @@ Label values:
6064
6165## RBAC Permissions
6266
63- All three health checks use the same RBAC configuration:
67+ All five health checks use the same RBAC configuration:
6468
6569- ** ServiceAccount** : ` active-health-checks-runner ` (in ` monitoring ` namespace)
6670- ** ClusterRole** : ` active-health-checks-runner-role `
@@ -77,21 +81,28 @@ The RBAC permissions allow the health check jobs to:
7781Install Kueue and MPI Operator (required for NCCL tests):
7882
7983``` bash
80- helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version=" 0.14.1 " --create-namespace --namespace=kueue-system
84+ helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version=" 0.14.2 " --create-namespace --namespace=kueue-system
8185
8286kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
8387```
8488
8589### Step 2: Deploy Active Health Checks
8690
87- Deploy all three health check CronJobs:
91+ Deploy all health check CronJobs:
8892
93+ ** For NVIDIA GPU clusters:**
8994``` bash
9095kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
9196kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
9297kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml
9398```
9499
100+ ** For AMD GPU clusters:**
101+ ``` bash
102+ kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
103+ kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml
104+ ```
105+
95106### Step 3: Verify Deployment
96107
97108Check that the CronJobs have been created:
@@ -100,7 +111,7 @@ Check that the CronJobs have been created:
100111kubectl get cronjobs -n monitoring
101112```
102113
103- ** Example output:**
114+ ** Example output (NVIDIA GPU clusters) :**
104115
105116```
106117NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
@@ -109,18 +120,29 @@ active-health-checks-gpu-fryer-applier 0 * * * * False 0 <non
109120active-health-checks-nccl-tests-applier 0 * * * * False 0 <none> 10s
110121```
111122
123+ ** Example output (AMD GPU clusters):**
124+
125+ ```
126+ NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
127+ active-health-checks-rccl-tests-applier 0 * * * * False 0 <none> 10s
128+ active-health-checks-rvs-applier 0 * * * * False 0 <none> 10s
129+ ```
130+
112131## Node Selection Logic
113132
114133All health checks follow this selection process:
115134
116- 1 . ** Find GPU Nodes** : Query nodes with ` nvidia.com/gpu=true ` label
135+ 1 . ** Find GPU Nodes** : Query nodes with appropriate GPU label
136+ - NVIDIA tests: ` nvidia.com/gpu=true ` label
137+ - AMD tests: ` amd.com/gpu=true ` label
1171382 . ** Check Idle Status** : Calculate GPU usage from pod requests
118139 - Only nodes with 0 GPU allocation are considered
1191403 . ** Check Last Run** : Parse ` *-last-run ` timestamp label
120141 - Skip nodes tested today (same UTC date)
1211424 . ** Select Nodes** :
122- - NCCL: Pick 2+ nodes of same shape
143+ - NCCL/RCCL : Pick 2+ nodes of same shape
123144 - GPU Fryer: Pick 1 node
145+ - RVS: Pick 1 node
124146 - DCGM: Pick 1 node
125147
126148This ensures:
@@ -140,18 +162,29 @@ kubectl get node <node-name> --show-labels | grep active-health-checks
140162
141163View all nodes with their health check labels:
142164
165+ ** For NVIDIA GPU nodes:**
143166``` bash
144167kubectl get nodes -o custom-columns=NAME:.metadata.name,NCCL:.metadata.labels.oke\. oraclecloud\. com/active-health-checks-nccl-tests,GPU_FRYER:.metadata.labels.oke\. oraclecloud\. com/active-health-checks-gpu-fryer,DCGM:.metadata.labels.oke\. oraclecloud\. com/active-health-checks-dcgm-diag
145168```
146169
170+ ** For AMD GPU nodes:**
171+ ``` bash
172+ kubectl get nodes -o custom-columns=NAME:.metadata.name,RCCL:.metadata.labels.oke\. oraclecloud\. com/active-health-checks-rccl-tests,RVS:.metadata.labels.oke\. oraclecloud\. com/active-health-checks-rvs
173+ ```
174+
147175### Identify Failed Nodes
148176
149177List nodes that have failed any health check:
150178
151179``` bash
180+ # NVIDIA GPU nodes
152181kubectl get nodes -l oke.oraclecloud.com/active-health-checks-nccl-tests=fail -o wide
153182kubectl get nodes -l oke.oraclecloud.com/active-health-checks-gpu-fryer=fail -o wide
154183kubectl get nodes -l oke.oraclecloud.com/active-health-checks-dcgm-diag=fail -o wide
184+
185+ # AMD GPU nodes
186+ kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rccl-tests=fail -o wide
187+ kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rvs=fail -o wide
155188```
156189
157190### View Health Check Job Logs
@@ -172,15 +205,25 @@ To manually trigger a health check outside the regular schedule:
172205
173206``` bash
174207# Create a one-off job from the CronJob
208+ # NVIDIA GPU tests
175209kubectl create job -n monitoring manual-nccl-test --from=cronjob/active-health-checks-nccl-tests-applier
176210kubectl create job -n monitoring manual-fryer-test --from=cronjob/active-health-checks-gpu-fryer-applier
177211kubectl create job -n monitoring manual-dcgm-test --from=cronjob/active-health-checks-dcgm-diag-applier
212+
213+ # AMD GPU tests
214+ kubectl create job -n monitoring manual-rccl-test --from=cronjob/active-health-checks-rccl-tests-applier
215+ kubectl create job -n monitoring manual-rvs-test --from=cronjob/active-health-checks-rvs-applier
178216```
179217
180218To run a test immediately on a specific node, you can temporarily modify the node labels to remove the last-run timestamp:
181219
182220``` bash
221+ # For NVIDIA nodes
183222kubectl label node < node-name> oke.oraclecloud.com/active-health-checks-nccl-tests-last-run-
223+
224+ # For AMD nodes
225+ kubectl label node < node-name> oke.oraclecloud.com/active-health-checks-rccl-tests-last-run-
226+ kubectl label node < node-name> oke.oraclecloud.com/active-health-checks-rvs-last-run-
184227```
185228
186229The next CronJob execution will then select this node for testing.
@@ -202,7 +245,9 @@ By default, health checks run every hour (`0 * * * *`). To modify the schedule:
202245
203246Each health check manifest can be customized with different parameters:
204247- ** NCCL Tests** : Number of nodes, GPU count, NCCL parameters
248+ - ** RCCL Tests** : Number of nodes, GPU count, RCCL parameters
205249- ** GPU Fryer** : Stress duration, temperature thresholds
250+ - ** RVS** : Test recipe, iterations, timeout, validation tests
206251- ** DCGM Diagnostics** : Diagnostic level, specific tests to run
207252
208253Download and modify the manifests locally before applying them for custom configurations.
@@ -223,11 +268,18 @@ kubectl patch cronjob active-health-checks-nccl-tests-applier -n monitoring -p '
223268
224269To remove active health checks:
225270
271+ ** For NVIDIA GPU clusters:**
226272``` bash
227273kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
228274kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
229275kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml
230276```
231277
278+ ** For AMD GPU clusters:**
279+ ``` bash
280+ kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
281+ kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml
282+ ```
283+
232284> [ !NOTE]
233285> Node labels applied by health checks will remain after uninstalling. To remove them, manually delete the labels from each node.
0 commit comments