You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+10-3Lines changed: 10 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -69,15 +69,15 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
69
69
70
70
## Summary
71
71
72
-
Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
72
+
Today it is hard to impossible to know when the Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP aims to fix this by exposing device health via Pod Status. The KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
73
73
74
74
## Motivation
75
75
76
76
Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
77
77
78
-
Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
78
+
Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
79
79
80
-
Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.
80
+
People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
81
81
82
82
### Goals
83
83
@@ -134,12 +134,19 @@ type ResourceStatus struct {
134
134
// allow to extend this struct in future with the overall health fields or things like Device Plugin version
135
135
}
136
136
137
+
// ResourceID is calculated based on source of this resource health information.
138
+
// For DevicePlugin:
139
+
// deviceplugin:Device.ID, where Device.ID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
140
+
// DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.
141
+
// For DRA:
142
+
// dra:<driver name>[/<pool name>]/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
137
143
type ResourceID string
138
144
139
145
type ResourceHealth struct {
140
146
// List of conditions with the transition times
141
147
Conditions []ResourceHealthCondition
142
148
}
149
+
143
150
// This condition type is replicating other condition types exposed by various status APIs
0 commit comments