Skip to content

Commit a803b0b

Browse files
addressed some comments
1 parent f3a88db commit a803b0b

File tree

1 file changed

+26
-9
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+26
-9
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ One field reflects the resource requests and limits and the other actual allocat
113113

114114
This structure will contain standard resources as well as extended resources. As noted in the comment: https://github.com/kubernetes/kubernetes/pull/124227#issuecomment-2130503713, it is only logical to also include the status of those allocated resources.
115115

116-
The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the AllocatedResources match the desired state.
116+
The proposal is to keep this structure as-is to simplify parsing of well-known ResourceList data type by various consumers. Typical scenario would be to compare if the `AllocatedResources` match the desired state.
117117

118118
The proposal is to introduce an additional field:
119119

@@ -125,19 +125,23 @@ AllocatedResourcesStatus ResourcesStatus
125125
type ResourcesStatus map[ResourceName]ResourceStatus
126126
127127
type ResourceStatus struct {
128-
// map of unique Resource ID to its status with the following restrictions:
129-
// - ResourceID must uniquely identify the Resource allocated to the Pod on the Node for the lifetime of a Pod.
130-
// - ResourceID may not make sense outside the Pod. Often it will identify the resource on the Node, but not guaranteed.
131-
// - ResourceID of one Pod must not be compared with the ResourceID of other Pod.
132-
// In case of Device Plugin ResourceID maps to DeviceID.
133-
Resources map[ResourceID] ResourceStatus
128+
// map of unique Resource ID to its health.
129+
// At a minimum, ResourceID must uniquely identify the Resource
130+
// allocated to the Pod on the Node for the lifetime of a Pod.
131+
// See ResourceID type for it's definition.
132+
Resources map[ResourceID] ResourceHealth
134133
135134
// allow to extend this struct in future with the overall health fields or things like Device Plugin version
136135
}
137136
138137
type ResourceID string
139138
140-
type ResourceStatus struct {
139+
type ResourceHealth struct {
140+
// List of conditions with the transition times
141+
Conditions []ResourceHealthCondition
142+
}
143+
// This condition type is replicating other condition types exposed by various status APIs
144+
type ResourceHealthCondition struct {
141145
// can be one of:
142146
// - Healthy: operates as normal
143147
// - Unhealthy: reported unhealthy. We consider this a temporary health issue
@@ -147,7 +151,19 @@ type ResourceStatus struct {
147151
// For example, Device Plugin got unregistered and hasn't been re-registered since.
148152
//
149153
// In future we may want to introduce the PermanentlyUnhealthy Status.
150-
Status string
154+
Type string
155+
156+
// Status of the condition, one of True, False, Unknown.
157+
Status ConditionStatus
158+
// The last time the condition transitioned from one status to another.
159+
// +optional
160+
LastTransitionTime metav1.Time
161+
// The reason for the condition's last transition.
162+
// +optional
163+
Reason string
164+
// A human readable message indicating details about the transition.
165+
// +optional
166+
Message string
151167
}
152168
```
153169

@@ -206,6 +222,7 @@ One improvement will be needed is to distinguish unhealthy devices (marked unhea
206222
NVIDIA device plugin has the checkHealth implementation: https://github.com/NVIDIA/k8s-device-plugin/blob/eb3a709b1dd82280d5acfb85e1e942024ddfcdc6/internal/rm/health.go#L39 that has more information than simple “Unhealthy”.
207223

208224
We should consider introducing another field to the Status that will be a free form error information as a future improvement.
225+
209226
### DRA implementation details
210227

211228
Today DRA does not return the health of the device back to kubelet. The proposal is to extend the

0 commit comments

Comments
 (0)