|
1 | | -# Deploying OKE Node Problem Detector( NPD) with OCI GPU Scanner Service |
| 1 | +# Installing OKE Node Problem Detector (NPD) DaemonSet with OCI GPU Scanner Service |
2 | 2 |
|
3 | | -OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that looks for GPU health check failures created by GPU Scanner service and tags/creates conditions on the failed node. This feature will allow you to only schedule GPU workloads on a healthy node and avoid running into application deployment issues. |
| 3 | +OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that processes GPU health check failures reported by GPU Scanner service and sets conditions on the affected nodes. This feature enables proactive monitoring of GPU node health and early detection of issues. |
4 | 4 |
|
5 | | -## Deployment |
| 5 | +## Install |
6 | 6 |
|
7 | | -These actions should only be performed after successful OCI GPU scanner service installation of control plane and data plane (plugin) components on individual GPU nodes. |
| 7 | +These actions should only be performed after a successful OCI GPU scanner service installation of control plane and data plane (plugin) components on individual GPU nodes. |
8 | 8 |
|
9 | | -❗❗**IMPORTANT**: Only deploy the NPD on the OKE cluster that has GPU compute resources added as node pools. NPD will only start processing health check events when OCI GPU Scanner Service data plane plugin is actively running. Both are tightly integrated |
| 9 | +❗❗**IMPORTANT**: NPD will only start processing GPU health check events when OCI GPU Scanner Service data plane plugin is actively running. |
| 10 | + |
| 11 | +Label the target GPU nodes so they can host the NPD DaemonSet. |
| 12 | + |
| 13 | +Only run this command on the GPU nodes that you would like to run the NPD feature on. |
10 | 14 |
|
11 | 15 | ```bash |
12 | | -kubectl apply -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml |
| 16 | +kubectl label node <nodeIP e.g 10.0.65.72> oci.oraclecloud.com/oke-node-problem-detector-enabled="true" |
13 | 17 | ``` |
14 | | -Verify that NPD has been installed successfully and running. |
| 18 | + |
| 19 | +Install the NPD DaemonSet. |
15 | 20 |
|
16 | 21 | ```bash |
17 | | -kubectl get pods -n kube-system |
| 22 | +kubectl apply -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml |
18 | 23 | ``` |
19 | | -Results should show ```oke-node-problem-detector``` in running state. |
20 | | - |
21 | | -## Activation per node |
22 | 24 |
|
23 | | -This step is required to state the NPD to run on these GPU node. Only run this command on the GPU nodes that you would like to run the NPD feature on. |
| 25 | +Verify that NPD DaemonSet has been installed successfully and running. |
24 | 26 |
|
25 | 27 | ```bash |
26 | | -kubectl label node <nodeIP e.g 10.0.65.72> oci.oraclecloud.com/oke-node-problem-detector-enabled="true |
| 28 | +kubectl get pods -l app=oke-node-problem-detector -o wide -n kube-system |
27 | 29 | ``` |
28 | | -This step will activate the NPD on each of these nodes. |
| 30 | + |
| 31 | +Results should show ```oke-node-problem-detector``` in running state for all targeted GPU nodes. |
29 | 32 |
|
30 | 33 | ## Uninstall |
31 | 34 |
|
32 | | -To remove teh NPD for an OKE cluster run the below command |
| 35 | +To remove the NPD from an OKE cluster, run the below command |
33 | 36 |
|
34 | 37 | ```bash |
35 | 38 | kubectl delete -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml |
|
0 commit comments