Skip to content

Commit 9acdc74

Browse files
added device ID clarification
1 parent a803b0b commit 9acdc74

File tree

1 file changed

+10
-3
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+10
-3
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,15 +69,15 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6969

7070
## Summary
7171

72-
Today it is difficult to know when a Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP will fix this by exposing device health via Pod Status. This KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
72+
Today it is hard to impossible to know when the Pod is using a device that has failed or is temporarily unhealthy. This makes troubleshooting of Pod crashes hard or impossible. This KEP aims to fix this by exposing device health via Pod Status. The KEP is intentionally scoped small, but can be extended later to expose more device information to troubleshoot Pod devices placement issues (for example, validating that related Pods are allocated on connected devices).
7373

7474
## Motivation
7575

7676
Device Plugin and DRA do not have a good failure handling strategy defined. With proliferation of workloads using devices (like GPU), variable quality of devices, and overcommitting of data centers on power, there are cases when devices can fail temporarily or permanently and k8s need to handle this natively.
7777

78-
Today, the typical design is for jobs consuming a failing device to fail with a specific error code whenever possible. For long running workloads, K8s will keep restarting the workload without reallocating it on a different device. So the container will be in crash loop backoff with limited information on why it is crashing.
78+
Today, the typical design is for jobs consuming a failing device to fail itself with the specific error code whenever possible. For the inference of long running workloads, k8s will keep restarting the workload without reallocating it on a different device. So container will be in crash loop backoff with limited information on why it is crashing.
7979

80-
Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device, and be able to respond to this properly.
80+
People develop strategies to deal with such situations. Exposing unhealthy devices in Pod Status will provide a generic way to understand that the failure is related to the unhealthy device and be able to respond to this properly.
8181

8282
### Goals
8383

@@ -134,12 +134,19 @@ type ResourceStatus struct {
134134
// allow to extend this struct in future with the overall health fields or things like Device Plugin version
135135
}
136136
137+
// ResourceID is calculated based on source of this resource health information.
138+
// For DevicePlugin:
139+
// deviceplugin:Device.ID, where Device.ID is from the Device structure of DevicePlugin's ListAndWatchResponse type: https://github.com/kubernetes/kubernetes/blob/eda1c780543a27c078450e2f17d674471e00f494/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1alpha/api.proto#L61-L73
140+
// DevicePlugin ID is usually a constant for the lifetime of a Node and typically can be used to uniquely identify the device on the node.
141+
// For DRA:
142+
// dra:<driver name>[/<pool name>]/<device name>: such a device can be looked up in the information published by that DRA driver to learn more about it. It is designed to be globally unique in a cluster.
137143
type ResourceID string
138144
139145
type ResourceHealth struct {
140146
// List of conditions with the transition times
141147
Conditions []ResourceHealthCondition
142148
}
149+
143150
// This condition type is replicating other condition types exposed by various status APIs
144151
type ResourceHealthCondition struct {
145152
// can be one of:

0 commit comments

Comments
 (0)