You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-node/4680-add-resource-health-to-pod-status/README.md
+26-9Lines changed: 26 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -113,7 +113,7 @@ One field reflects the resource requests and limits and the other actual allocat
113
113
114
114
This structure will contain standard resources as well as extended resources. As noted in the comment: https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713, it is only logical to also include the status of those allocated resources.
115
115
116
-
The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the AllocatedResources match the desired state.
116
+
The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the `AllocatedResources` match the desired state.
type ResourcesStatus map[ResourceName]ResourceStatus
126
126
127
127
type ResourceStatus struct {
128
-
// map of unique Resource ID to its status with the following restrictions:
129
-
// - ResourceID must uniquely identify the Resource allocated to the Pod on the Node for the lifetime of a Pod.
130
-
// - ResourceID may not make sense outside the Pod. Often it will identify the resource on the Node, but not guaranteed.
131
-
// - ResourceID of one Pod must not be compared with the ResourceID of other Pod.
132
-
// In case of Device Plugin ResourceID maps to DeviceID.
133
-
Resources map[ResourceID] ResourceStatus
128
+
// map of unique Resource ID to its health.
129
+
// At a minimum, ResourceID must uniquely identify the Resource
130
+
// allocated to the Pod on the Node for the lifetime of a Pod.
131
+
// See ResourceID type for it's definition.
132
+
Resources map[ResourceID] ResourceHealth
134
133
135
134
// allow to extend this struct in future with the overall health fields or things like Device Plugin version
136
135
}
137
136
138
137
type ResourceID string
139
138
140
-
type ResourceStatus struct {
139
+
type ResourceHealth struct {
140
+
// List of conditions with the transition times
141
+
Conditions []ResourceHealthCondition
142
+
}
143
+
// This condition type is replicating other condition types exposed by various status APIs
144
+
type ResourceHealthCondition struct {
141
145
// can be one of:
142
146
// - Healthy: operates as normal
143
147
// - Unhealthy: reported unhealthy. We consider this a temporary health issue
@@ -147,7 +151,19 @@ type ResourceStatus struct {
147
151
// For example, Device Plugin got unregistered and hasn't been re-registered since.
148
152
//
149
153
// In future we may want to introduce the PermanentlyUnhealthy Status.
150
-
Status string
154
+
Type string
155
+
156
+
// Status of the condition, one of True, False, Unknown.
157
+
Status ConditionStatus
158
+
// The last time the condition transitioned from one status to another.
159
+
// +optional
160
+
LastTransitionTime metav1.Time
161
+
// The reason for the condition's last transition.
162
+
// +optional
163
+
Reason string
164
+
// A human readable message indicating details about the transition.
165
+
// +optional
166
+
Message string
151
167
}
152
168
```
153
169
@@ -206,6 +222,7 @@ One improvement will be needed is to distinguish unhealthy devices (marked unhea
206
222
NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.
207
223
208
224
We should consider introducing another field to the Status that will be a free form error information as a future improvement.
225
+
209
226
### DRA implementation details
210
227
211
228
Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
0 commit comments