Skip to content

Commit 42dc7d6

Browse files
addressed some comments
1 parent 4cd5312 commit 42dc7d6

File tree

1 file changed

+18
-5
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+18
-5
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,11 @@ AllocatedResourcesStatus ResourcesStatus
125125
type ResourcesStatus map[ResourceName]ResourceStatus
126126
127127
type ResourceStatus struct {
128-
// map of unique Resource ID to its status
128+
// map of unique Resource ID to its status with the following restrictions:
129+
// - ResourceID must uniquely identify the Resource allocated to the Pod on the Node for the lifetime of a Pod.
130+
// - ResourceID may not make sense outside the Pod. Often it will identify the resource on the Node, but not guaranteed.
131+
// - ResourceID of one Pod must not be compared with the ResourceID of other Pod.
132+
// In case of Device Plugin ResourceID maps to DeviceID.
129133
Resources map[ResourceID] ResourceStatus
130134
131135
// allow to extend this struct in future with the overall health fields or things like Device Plugin version
@@ -136,18 +140,25 @@ type ResourceID string
136140
type ResourceStatus struct {
137141
// can be one of:
138142
// - Healthy: operates as normal
139-
// - Unhealthy: reported unhealthy. We consider this a temporary health issue since we do not mechanism today to distinguish temporary and permanent issues.
140-
// - Unknown: The status cannot be determined. For example, Device Plugin got unregistered and hasn't been re-registered since.
143+
// - Unhealthy: reported unhealthy. We consider this a temporary health issue
144+
// since we do not mechanism today to distinguish
145+
// temporary and permanent issues.
146+
// - Unknown: The status cannot be determined.
147+
// For example, Device Plugin got unregistered and hasn't been re-registered since.
148+
//
149+
// In future we may want to introduce the PermanentlyUnhealthy Status.
141150
Status string
142151
}
143152
```
144153

145154
***Is there any guarantee that the AllocatedResourcesStatus will be updated before Container crashed and unscheduled?***
146155

147-
No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already.
156+
No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already (if `restartPolicy=Never`, in other cases Pod will enter crash loop backoff).
148157

149158
The proposal is to update the Pod Status with the device status even if the pod has been marked as Failed already, but still known to the kubelet.
150159

160+
This use case is important to explore so this status may inform the retry policy introduced by the KEP: [Retriable and non-retriable Pod failures for Jobs](https://github.com/kubernetes/enhancements/issues/3329).
161+
151162

152163
***Do we need the CheckDeviceHealth call introduced to the Device Plugin to work around the limitation above?***
153164

@@ -200,10 +211,12 @@ We should consider introducing another field to the Status that will be a free f
200211
Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
201212

202213
type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in
203-
the Device Plugin.
214+
the Device Plugin as well as a device ID.
204215

205216
Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
206217

218+
Specific implementation details will be added for the beta.
219+
207220
### Test Plan
208221

209222
[X] I/we understand the owners of the involved components may require updates to

0 commit comments

Comments
 (0)