oracle-quickstart
diff --git a/‎docs/deploying-monitoring-stack-manually.md‎
Lines changed: 827 additions & 0 deletions b/‎docs/deploying-monitoring-stack-manually.md‎
Lines changed: 827 additions & 0 deletions
diff --git a/‎docs/running-active-health-checks.md‎
Lines changed: 64 additions & 12 deletions b/‎docs/running-active-health-checks.md‎
Lines changed: 64 additions & 12 deletions
diff --git a/‎manifests/active-health-checks/active-health-checks-dcgm-diag.yaml‎
Lines changed: 3 additions & 3 deletions b/‎manifests/active-health-checks/active-health-checks-dcgm-diag.yaml‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎manifests/active-health-checks/active-health-checks-gpu-fryer.yaml‎
Lines changed: 3 additions & 3 deletions b/‎manifests/active-health-checks/active-health-checks-gpu-fryer.yaml‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎manifests/active-health-checks/active-health-checks-nccl-tests.yaml‎
Lines changed: 6 additions & 6 deletions b/‎manifests/active-health-checks/active-health-checks-nccl-tests.yaml‎
Lines changed: 6 additions & 6 deletions
@@ -9,11 +9,13 @@ Active health checks provide automated, periodic validation of GPU & RDMA functi
 
 ### Available Health Check Types
 
-Three types of active health checks are available:
+Five types of active health checks are available:
 
-1. **NCCL Tests** - Multi-node GPU communication tests using NVIDIA NCCL
-2. **GPU Fryer** - Single-node GPU stress testing
-3. **DCGM Diagnostics** - Host-level GPU diagnostics using NVIDIA DCGM
+1. **NCCL Tests** - Multi-node GPU communication tests using NVIDIA NCCL (NVIDIA GPUs)
+2. **RCCL Tests** - Multi-node GPU communication tests using AMD RCCL (AMD GPUs)
+3. **GPU Fryer** - Single-node GPU stress testing (NVIDIA GPUs)
+4. **RVS** - Single-node GPU validation using ROCm Validation Suite (AMD GPUs)
+5. **DCGM Diagnostics** - Host-level GPU diagnostics using NVIDIA DCGM (NVIDIA GPUs)
 
 ### How It Works
 
@@ -27,8 +29,8 @@ Each health check runs as a CronJob that:
 
 - OKE cluster with GPU nodes
 - kubectl access with cluster-admin privileges
-- Kueue installed (for NCCL tests)
-- MPI Operator installed (for NCCL tests)
+- Kueue installed
+- MPI Operator installed (for NCCL and RCCL tests)
 - Monitoring namespace (or permission to create it)
 
 ## Architecture
@@ -51,7 +53,9 @@ Each health check applies two labels to tested nodes:
 | Health Check | Pass/Fail Label | Timestamp Label |
 |--------------|----------------|-----------------|
 | NCCL Tests | `oke.oraclecloud.com/active-health-checks-nccl-tests` | `oke.oraclecloud.com/active-health-checks-nccl-tests-last-run` |
+| RCCL Tests | `oke.oraclecloud.com/active-health-checks-rccl-tests` | `oke.oraclecloud.com/active-health-checks-rccl-tests-last-run` |
 | GPU Fryer | `oke.oraclecloud.com/active-health-checks-gpu-fryer` | `oke.oraclecloud.com/active-health-checks-gpu-fryer-last-run` |
+| RVS | `oke.oraclecloud.com/active-health-checks-rvs` | `oke.oraclecloud.com/active-health-checks-rvs-last-run` |
 | DCGM Diagnostics | `oke.oraclecloud.com/active-health-checks-dcgm-diag` | `oke.oraclecloud.com/active-health-checks-dcgm-diag-last-run` |
 
 Label values:
@@ -60,7 +64,7 @@ Label values:
 
 ## RBAC Permissions
 
-All three health checks use the same RBAC configuration:
+All five health checks use the same RBAC configuration:
 
 - **ServiceAccount**: `active-health-checks-runner` (in `monitoring` namespace)
 - **ClusterRole**: `active-health-checks-runner-role`
@@ -77,21 +81,28 @@ The RBAC permissions allow the health check jobs to:
 Install Kueue and MPI Operator (required for NCCL tests):
 
 ```bash
-helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.1" --create-namespace --namespace=kueue-system
+helm install kueue oci://registry.k8s.io/kueue/charts/kueue --version="0.14.2" --create-namespace --namespace=kueue-system
 
 kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
 ```
 
 ### Step 2: Deploy Active Health Checks
 
-Deploy all three health check CronJobs:
+Deploy all health check CronJobs:
 
+**For NVIDIA GPU clusters:**
 ```bash
 kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
 kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
 kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml
 ```
 
+**For AMD GPU clusters:**
+```bash
+kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
+kubectl apply -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml
+```
+
 ### Step 3: Verify Deployment
 
 Check that the CronJobs have been created:
@@ -100,7 +111,7 @@ Check that the CronJobs have been created:
 kubectl get cronjobs -n monitoring
 ```
 
-**Example output:**
+**Example output (NVIDIA GPU clusters):**
 
 ```
 NAME                                       SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
@@ -109,18 +120,29 @@ active-health-checks-gpu-fryer-applier     0 * * * *     False     0        <non
 active-health-checks-nccl-tests-applier    0 * * * *     False     0        <none>          10s
 ```
 
+**Example output (AMD GPU clusters):**
+
+```
+NAME                                       SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
+active-health-checks-rccl-tests-applier    0 * * * *     False     0        <none>          10s
+active-health-checks-rvs-applier           0 * * * *     False     0        <none>          10s
+```
+
 ## Node Selection Logic
 
 All health checks follow this selection process:
 
-1. **Find GPU Nodes**: Query nodes with `nvidia.com/gpu=true` label
+1. **Find GPU Nodes**: Query nodes with appropriate GPU label
+   - NVIDIA tests: `nvidia.com/gpu=true` label
+   - AMD tests: `amd.com/gpu=true` label
 2. **Check Idle Status**: Calculate GPU usage from pod requests
    - Only nodes with 0 GPU allocation are considered
 3. **Check Last Run**: Parse `*-last-run` timestamp label
    - Skip nodes tested today (same UTC date)
 4. **Select Nodes**:
-   - NCCL: Pick 2+ nodes of same shape
+   - NCCL/RCCL: Pick 2+ nodes of same shape
    - GPU Fryer: Pick 1 node
+   - RVS: Pick 1 node
    - DCGM: Pick 1 node
 
 This ensures:
@@ -140,18 +162,29 @@ kubectl get node <node-name> --show-labels | grep active-health-checks
 
 View all nodes with their health check labels:
 
+**For NVIDIA GPU nodes:**
 ```bash
 kubectl get nodes -o custom-columns=NAME:.metadata.name,NCCL:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-nccl-tests,GPU_FRYER:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-gpu-fryer,DCGM:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-dcgm-diag
 ```
 
+**For AMD GPU nodes:**
+```bash
+kubectl get nodes -o custom-columns=NAME:.metadata.name,RCCL:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-rccl-tests,RVS:.metadata.labels.oke\.oraclecloud\.com/active-health-checks-rvs
+```
+
 ### Identify Failed Nodes
 
 List nodes that have failed any health check:
 
 ```bash
+# NVIDIA GPU nodes
 kubectl get nodes -l oke.oraclecloud.com/active-health-checks-nccl-tests=fail -o wide
 kubectl get nodes -l oke.oraclecloud.com/active-health-checks-gpu-fryer=fail -o wide
 kubectl get nodes -l oke.oraclecloud.com/active-health-checks-dcgm-diag=fail -o wide
+
+# AMD GPU nodes
+kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rccl-tests=fail -o wide
+kubectl get nodes -l oke.oraclecloud.com/active-health-checks-rvs=fail -o wide
 ```
 
 ### View Health Check Job Logs
@@ -172,15 +205,25 @@ To manually trigger a health check outside the regular schedule:
 
 ```bash
 # Create a one-off job from the CronJob
+# NVIDIA GPU tests
 kubectl create job -n monitoring manual-nccl-test --from=cronjob/active-health-checks-nccl-tests-applier
 kubectl create job -n monitoring manual-fryer-test --from=cronjob/active-health-checks-gpu-fryer-applier
 kubectl create job -n monitoring manual-dcgm-test --from=cronjob/active-health-checks-dcgm-diag-applier
+
+# AMD GPU tests
+kubectl create job -n monitoring manual-rccl-test --from=cronjob/active-health-checks-rccl-tests-applier
+kubectl create job -n monitoring manual-rvs-test --from=cronjob/active-health-checks-rvs-applier
 ```
 
 To run a test immediately on a specific node, you can temporarily modify the node labels to remove the last-run timestamp:
 
 ```bash
+# For NVIDIA nodes
 kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-nccl-tests-last-run-
+
+# For AMD nodes
+kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-rccl-tests-last-run-
+kubectl label node <node-name> oke.oraclecloud.com/active-health-checks-rvs-last-run-
 ```
 
 The next CronJob execution will then select this node for testing.
@@ -202,7 +245,9 @@ By default, health checks run every hour (`0 * * * *`). To modify the schedule:
 
 Each health check manifest can be customized with different parameters:
 - **NCCL Tests**: Number of nodes, GPU count, NCCL parameters
+- **RCCL Tests**: Number of nodes, GPU count, RCCL parameters
 - **GPU Fryer**: Stress duration, temperature thresholds
+- **RVS**: Test recipe, iterations, timeout, validation tests
 - **DCGM Diagnostics**: Diagnostic level, specific tests to run
 
 Download and modify the manifests locally before applying them for custom configurations.
@@ -223,11 +268,18 @@ kubectl patch cronjob active-health-checks-nccl-tests-applier -n monitoring -p '
 
 To remove active health checks:
 
+**For NVIDIA GPU clusters:**
 ```bash
 kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-nccl-tests.yaml
 kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-gpu-fryer.yaml
 kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-dcgm-diag.yaml
 ```
 
+**For AMD GPU clusters:**
+```bash
+kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rccl-tests.yaml
+kubectl delete -f https://raw.githubusercontent.com/oracle-quickstart/oci-hpc-oke/refs/heads/main/manifests/active-health-checks/active-health-checks-rvs.yaml
+```
+
 > [!NOTE]
 > Node labels applied by health checks will remain after uninstalling. To remove them, manually delete the labels from each node.
@@ -2,8 +2,8 @@
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
-  name: active-health-checks-low
-value: -1000
+  name: active-health-checks-dcgm-diag-low
+value: -100000
 globalDefault: false
 preemptionPolicy: PreemptLowerPriority
 description: "Very low priority for active health check jobs to be preempted by others"
@@ -139,7 +139,7 @@ spec:
                     labels:
                       app: dcgm-diag-test
                   spec:
-                    priorityClassName: active-health-checks-low
+                    priorityClassName: active-health-checks-dcgm-diag-low
                     restartPolicy: Never
                     nodeSelector:
                       kubernetes.io/hostname: $test_node
 
@@ -2,8 +2,8 @@
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
-  name: active-health-checks-low
-value: -1000
+  name: active-health-checks-gpu-fryer-low
+value: -100000
 globalDefault: false
 preemptionPolicy: PreemptLowerPriority
 description: "Very low priority for active health check jobs to be preempted by others"
@@ -138,7 +138,7 @@ spec:
                     labels:
                       app: gpu-fryer-test
                   spec:
-                    priorityClassName: active-health-checks-low
+                    priorityClassName: active-health-checks-gpu-fryer-low
                     restartPolicy: Never
                     nodeSelector:
                       kubernetes.io/hostname: $test_node
 
@@ -2,8 +2,8 @@
 apiVersion: scheduling.k8s.io/v1
 kind: PriorityClass
 metadata:
-  name: active-health-checks-low
-value: -1000
+  name: active-health-checks-nccl-tests-low
+value: -100000
 globalDefault: false
 preemptionPolicy: PreemptLowerPriority
 description: "Very low priority for active health check jobs to be preempted by others"
@@ -111,7 +111,7 @@ data:
               labels:
                 nccl-test-replica: mpi-launcher
             spec:
-              priorityClassName: active-health-checks-low
+              priorityClassName: active-health-checks-nccl-tests-low
               hostNetwork: true
               dnsPolicy: ClusterFirstWithHostNet
               # NODE_AFFINITY_PLACEHOLDER
@@ -184,7 +184,7 @@ data:
               labels:
                 nccl-test-replica: mpi-worker
             spec:
-              priorityClassName: active-health-checks-low
+              priorityClassName: active-health-checks-nccl-tests-low
               hostNetwork: true
               dnsPolicy: ClusterFirstWithHostNet
               containers:
@@ -286,7 +286,7 @@ data:
               labels:
                 nccl-test-replica: mpi-launcher
             spec:
-              priorityClassName: active-health-checks-low
+              priorityClassName: active-health-checks-nccl-tests-low
               hostNetwork: true
               dnsPolicy: ClusterFirstWithHostNet
               # NODE_AFFINITY_PLACEHOLDER
@@ -355,7 +355,7 @@ data:
               labels:
                 nccl-test-replica: mpi-worker
             spec:
-              priorityClassName: active-health-checks-low
+              priorityClassName: active-health-checks-nccl-tests-low
               hostNetwork: true
               dnsPolicy: ClusterFirstWithHostNet
               containers: