Skip to content

Commit e569510

Browse files
Merge pull request #48 from oracle-quickstart/25.3.1
Add v25.3.1
2 parents 6bac725 + 7a7fbf2 commit e569510

24 files changed

+8701
-7351
lines changed

README.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable
44
Please visit the [OKE documentation page](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengoverview.htm) for more information.
55

66
### Supported Operating Systems
7-
For the Nvidia A100 and H100 shapes (BM.GPU.H100.8, BM.GPU.A100-v2.8, BM.GPU4.8, BM.GPU.B4.8) and AMD MI300x shape (BM.GPU.MI300X.8), Ubuntu 22.04 is supported.
7+
- Ubuntu 22.04
88

99
### Required policies
1010
The OCI Resource Manager stack template uses the [Self Managed Nodes](https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contengworkingwithselfmanagednodes.htm) functionality of OKE.
@@ -22,18 +22,21 @@ You can use the below images for both CPU and GPU pools.
2222
> [!NOTE]
2323
> The GPU image has the GPU drivers pre-installed.
2424
25-
#### Image to import and use for the H100 and A100 nodes
25+
#### Images to use
2626
You can use the instructions [here](https://docs.oracle.com/en-us/iaas/Content/Compute/Tasks/imageimportexport.htm#Importing) for importing the below image to your tenancy.
2727

2828
**Images for NVIDIA shapes**
2929

30-
- [GPU driver 560 & CUDA 12.6](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-560-CUDA-12.6-2025-03-05.01)
30+
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-570-CUDA-12.8-2025.03.26-0)
31+
32+
- [GPU driver 560 & CUDA 12.6](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-560-CUDA-12.6-2025.03.26-0)
33+
34+
- [GPU driver 550 & CUDA 12.4](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-550-CUDA-12.4-2025.03.26-0)
3135

32-
- [GPU driver 570 & CUDA 12.8](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-GPU-570-CUDA-12.8-2025-03-06.01)
3336

3437
**Image for AMD shapes**
3538

36-
- [ROCm 6.3](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-ROCM-632-2025-03-05.01)
39+
- [ROCm 6.3](https://objectstorage.ca-montreal-1.oraclecloud.com/p/ts6fjAuj7hY4io5x_jfX3fyC70HRCG8-9gOFqAjuF0KE0s-6tgDZkbRRZIbMZmoN/n/hpc_limited_availability/b/images/o/Canonical-Ubuntu-22.04-2024.10.04-0-OCA-OFED-24.10-1.1.4.0-AMD-ROCM-632-2025.03.26-0)
3740

3841

3942
### Deploy the cluster using the Oracle Cloud Resource Manager template

docs/running-gpu-rdma-healtchecks-with-node-problem-detector.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,15 @@ Please note depending on the shape and its configuration, some health checks wil
1010
| GpuEcc | Checks for GPU ECC errors |
1111
| GpuRowRemap | Checks for GPU Row Remapping Errors |
1212
| GpuBus | Checks if any GPU has fallen off the bus |
13+
| GpuPcie | Checks if PCIE has the expected bandwidth |
14+
| GpuFabricManager | Checks if Fabric Manager is running |
15+
| GpuBadPages | Checks if any AMD GPU has bad pages |
1316
| RdmaLink | Checks if RDMA links are up |
1417
| RdmaLinkFlapping | Checks if there's any RDMA links that are flapping |
1518
| RdmaWpaAuth | Checks if all RDMA interfaces are authenticated |
1619
| RdmaRttcc | Checks if RTTCC is disabled on the RDMA interfaces |
1720
| OcaVersion | Checks if node has the correct Oracle Cloud Agent version |
21+
| CpuProfile | Checks if the CPU profile is set to performance |
1822

1923
#### Deployment
2024
You can deploy using the Node Problem Detector Helm chart. The health check scripts are created as a `ConfigMap`, so please make sure you use the `values.yaml` in the link below.

0 commit comments

Comments
 (0)