You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
type ResourcesStatus map[ResourceName]ResourceStatus
126
126
127
127
type ResourceStatus struct {
128
-
// map of unique Resource ID to its status
128
+
// map of unique Resource ID to its status with the following restrictions:
129
+
// - ResourceID must uniquely identify the Resource allocated to the Pod on the Node for the lifetime of a Pod.
130
+
// - ResourceID may not make sense outside the Pod. Often it will identify the resource on the Node, but not guaranteed.
131
+
// - ResourceID of one Pod must not be compared with the ResourceID of other Pod.
132
+
// In case of Device Plugin ResourceID maps to DeviceID.
129
133
Resources map[ResourceID] ResourceStatus
130
134
131
135
// allow to extend this struct in future with the overall health fields or things like Device Plugin version
@@ -136,18 +140,25 @@ type ResourceID string
136
140
type ResourceStatus struct {
137
141
// can be one of:
138
142
// - Healthy: operates as normal
139
-
// - Unhealthy: reported unhealthy. We consider this a temporary health issue since we do not mechanism today to distinguish temporary and permanent issues.
140
-
// - Unknown: The status cannot be determined. For example, Device Plugin got unregistered and hasn't been re-registered since.
143
+
// - Unhealthy: reported unhealthy. We consider this a temporary health issue
144
+
// since we do not mechanism today to distinguish
145
+
// temporary and permanent issues.
146
+
// - Unknown: The status cannot be determined.
147
+
// For example, Device Plugin got unregistered and hasn't been re-registered since.
148
+
//
149
+
// In future we may want to introduce the PermanentlyUnhealthy Status.
141
150
Status string
142
151
}
143
152
```
144
153
145
154
***Is there any guarantee that the AllocatedResourcesStatus will be updated before Container crashed and unscheduled?***
146
155
147
-
No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already.
156
+
No, there is no guarantee that the Device Plugin/DRA will detect device going unhealthy earlier than the Pod. Once device got unhealthy, container may crash and being marked as Failed already (if `restartPolicy=Never`, in other cases Pod will enter crash loop backoff).
148
157
149
158
The proposal is to update the Pod Status with the device status even if the pod has been marked as Failed already, but still known to the kubelet.
150
159
160
+
This use case is important to explore so this status may inform the retry policy introduced by the KEP: [Retriable and non-retriable Pod failures for Jobs](https://github.com/kubernetes/enhancements/issues/3329).
161
+
151
162
152
163
***Do we need the CheckDeviceHealth call introduced to the Device Plugin to work around the limitation above?***
153
164
@@ -200,10 +211,12 @@ We should consider introducing another field to the Status that will be a free f
200
211
Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
201
212
202
213
type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in
203
-
the Device Plugin.
214
+
the Device Plugin as well as a device ID.
204
215
205
216
Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
206
217
218
+
Specific implementation details will be added for the beta.
219
+
207
220
### Test Plan
208
221
209
222
[X] I/we understand the owners of the involved components may require updates to
0 commit comments