You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-5Lines changed: 8 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable
4
4
Please visit the [OKE documentation page](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm) for more information.
5
5
6
6
### Supported Operating Systems
7
-
For the Nvidia A100 and H100 shapes (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8) and AMD MI300x shape (BM.GPU.MI300X.8), Ubuntu 22.04 is supported.
7
+
-Ubuntu 22.04
8
8
9
9
### Required policies
10
10
The OCI Resource Manager stack template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
@@ -22,18 +22,21 @@ You can use the below images for both CPU and GPU pools.
22
22
> [!NOTE]
23
23
> The GPU image has the GPU drivers pre-installed.
24
24
25
-
#### Image to import and use for the H100 and A100 nodes
25
+
#### Images to use
26
26
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
27
27
28
28
**Images for NVIDIA shapes**
29
29
30
-
-[GPU driver 560 & CUDA 12.6](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-560-CUDA-12.6-2025-03-05.01)
30
+
-[GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-570-CUDA-12.8-2025.03.26-0)
31
+
32
+
-[GPU driver 560 & CUDA 12.6](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-560-CUDA-12.6-2025.03.26-0)
33
+
34
+
-[GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.03.26-0)
31
35
32
-
-[GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-570-CUDA-12.8-2025-03-06.01)
Copy file name to clipboardExpand all lines: docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,11 +10,15 @@ Please note depending on the shape and its configuration, some health checks wil
10
10
| GpuEcc | Checks for GPU ECC errors |
11
11
| GpuRowRemap | Checks for GPU Row Remapping Errors |
12
12
| GpuBus | Checks if any GPU has fallen off the bus |
13
+
| GpuPcie | Checks if PCIE has the expected bandwidth |
14
+
| GpuFabricManager | Checks if Fabric Manager is running |
15
+
| GpuBadPages | Checks if any AMD GPU has bad pages |
13
16
| RdmaLink | Checks if RDMA links are up |
14
17
| RdmaLinkFlapping | Checks if there's any RDMA links that are flapping |
15
18
| RdmaWpaAuth | Checks if all RDMA interfaces are authenticated |
16
19
| RdmaRttcc | Checks if RTTCC is disabled on the RDMA interfaces |
17
20
| OcaVersion | Checks if node has the correct Oracle Cloud Agent version |
21
+
| CpuProfile | Checks if the CPU profile is set to performance |
18
22
19
23
#### Deployment
20
24
You can deploy using the Node Problem Detector Helm chart. The health check scripts are created as a `ConfigMap`, so please make sure you use the `values.yaml` in the link below.
0 commit comments