Skip to content

Commit a7f2c57

Browse files
committed
updated NPD deploy
1 parent 98389bf commit a7f2c57

File tree

1 file changed

+18
-15
lines changed

1 file changed

+18
-15
lines changed

OKE_NPD_DEPLOY.md

Lines changed: 18 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,38 @@
1-
# Deploying OKE Node Problem Detector( NPD) with OCI GPU Scanner Service
1+
# Installing OKE Node Problem Detector (NPD) DaemonSet with OCI GPU Scanner Service
22

3-
OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that looks for GPU health check failures created by GPU Scanner service and tags/creates conditions on the failed node. This feature will allow you to only schedule GPU workloads on a healthy node and avoid running into application deployment issues.
3+
OKE NPD is an extension of https://github.com/kubernetes/node-problem-detector that processes GPU health check failures reported by GPU Scanner service and sets conditions on the affected nodes. This feature enables proactive monitoring of GPU node health and early detection of issues.
44

5-
## Deployment
5+
## Install
66

7-
These actions should only be performed after successful OCI GPU scanner service installation of control plane and data plane (plugin) components on individual GPU nodes.
7+
These actions should only be performed after a successful OCI GPU scanner service installation of control plane and data plane (plugin) components on individual GPU nodes.
88

9-
❗❗**IMPORTANT**: Only deploy the NPD on the OKE cluster that has GPU compute resources added as node pools. NPD will only start processing health check events when OCI GPU Scanner Service data plane plugin is actively running. Both are tightly integrated
9+
❗❗**IMPORTANT**: NPD will only start processing GPU health check events when OCI GPU Scanner Service data plane plugin is actively running.
10+
11+
Label the target GPU nodes so they can host the NPD DaemonSet.
12+
13+
Only run this command on the GPU nodes that you would like to run the NPD feature on.
1014

1115
```bash
12-
kubectl apply -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml
16+
kubectl label node <nodeIP e.g 10.0.65.72> oci.oraclecloud.com/oke-node-problem-detector-enabled="true"
1317
```
14-
Verify that NPD has been installed successfully and running.
18+
19+
Install the NPD DaemonSet.
1520

1621
```bash
17-
kubectl get pods -n kube-system
22+
kubectl apply -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml
1823
```
19-
Results should show ```oke-node-problem-detector``` in running state.
20-
21-
## Activation per node
2224

23-
This step is required to state the NPD to run on these GPU node. Only run this command on the GPU nodes that you would like to run the NPD feature on.
25+
Verify that NPD DaemonSet has been installed successfully and running.
2426

2527
```bash
26-
kubectl label node <nodeIP e.g 10.0.65.72> oci.oraclecloud.com/oke-node-problem-detector-enabled="true
28+
kubectl get pods -l app=oke-node-problem-detector -o wide -n kube-system
2729
```
28-
This step will activate the NPD on each of these nodes.
30+
31+
Results should show ```oke-node-problem-detector``` in running state for all targeted GPU nodes.
2932

3033
## Uninstall
3134

32-
To remove teh NPD for an OKE cluster run the below command
35+
To remove the NPD from an OKE cluster, run the below command
3336

3437
```bash
3538
kubectl delete -f https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/existing_cluster_deploy/oke-node-problem-detector.yaml

0 commit comments

Comments
 (0)