Skip to content

Commit 45760aa

Browse files
committed
Promote 4680 to beta
Signed-off-by: Harshal Patil <[email protected]>
1 parent 11b6321 commit 45760aa

File tree

3 files changed

+42
-7
lines changed

3 files changed

+42
-7
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 4680
22
alpha:
33
approver: "@jpbetz"
4+
beta:
5+
approver: "@jpbetz"

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -198,6 +198,11 @@ type ResourceHealth struct {
198198
//
199199
// In future we may want to introduce the PermanentlyUnhealthy Status.
200200
Health ResourceHealthStatus `json:"health,omitempty" protobuf:"bytes,2,name=health"`
201+
// Message provides additional human-readable context about the health status.
202+
// This can include error details, failure reasons, or other diagnostic information.
203+
// This field is optional and may be empty for healthy resources.
204+
// +optional
205+
Message string `json:"message,omitempty" protobuf:"bytes,3,opt,name=message"`
201206
}
202207
```
203208

@@ -267,6 +272,20 @@ We may consider this as a future improvement.
267272
[Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118) and the discussion in
268273
[PR #130606](https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511).
269274

275+
276+
- **Failure Message Field:** The `ResourceHealth` struct includes an optional `Message` field that provides
277+
additional human-readable context about device health status. This field enables Device Plugins and DRA drivers
278+
to report detailed error information, failure reasons, and diagnostic information beyond the basic health status.
279+
This enhancement improves troubleshooting capabilities for device-related failures. See
280+
[Issue #133202](https://github.com/kubernetes/kubernetes/issues/133202) and
281+
[PR #134506](https://github.com/kubernetes/kubernetes/pull/134506) for implementation details.
282+
283+
- **Device Health for Terminated Pods:** Kubelet will continue to update the device health status in PodStatus
284+
even after a Pod has terminated (e.g., in Failed state or CrashLoopBackOff). This is critical for post-mortem
285+
troubleshooting and enables retry policies (such as those introduced by
286+
[KEP-3329: Retriable and non-retriable Pod failures for Jobs](https://github.com/kubernetes/enhancements/issues/3329))
287+
to make informed decisions based on whether the failure was caused by an unhealthy device. See
288+
[Issue #132978](https://github.com/kubernetes/kubernetes/issues/132978) for more details.
270289
### Risks and Mitigations
271290

272291
There is not many risks of this KEP. The biggest risk is that Device Plugins will not be
@@ -459,8 +478,18 @@ Planned tests will cover the user-visible behavior of the feature:
459478

460479
#### Beta
461480

481+
The following requirements must be met for Beta graduation in v1.35:
482+
462483
- Complete e2e tests coverage
463-
- Verify configurable device health check timeout implementation works correctly across different plugin vendors (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
484+
- **Device Health for Terminated Pods** ([Issue #132978](https://github.com/kubernetes/kubernetes/issues/132978)):
485+
Ensure that device health status is correctly reported and updated in PodStatus even after a Pod has terminated.
486+
This is critical for troubleshooting and allowing retry policies to make informed decisions based on device health.
487+
- **Configurable Device Health Check Timeout** ([Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118), [PR #133752](https://github.com/kubernetes/kubernetes/pull/133752)):
488+
Verify that the configurable device health check timeout implementation (via `health_check_timeout_seconds` field)
489+
works correctly across different plugin vendors and hardware types (e.g., GPUs, FPGAs, TPUs, storage devices).
490+
- **Failure Message Field** ([Issue #133202](https://github.com/kubernetes/kubernetes/issues/133202), [PR #134506](https://github.com/kubernetes/kubernetes/pull/134506)):
491+
Support for a message field in device health reporting to provide additional context about health status and failures,
492+
enabling better troubleshooting capabilities.
464493

465494
#### GA
466495

@@ -497,15 +526,19 @@ No
497526

498527
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
499528

500-
Yes, with no side effect except of missing the new field in pod status. Values written
501-
while the feature was enabled will continue to have it and may be wiped on next update request.
502-
They also may be ignored on reads.
529+
Yes, with no side effect except of missing the new field in pod status. When the feature is disabled,
530+
the values of the `AllocatedResourcesStatus` fields will be dropped when serving the API even if they
531+
are written to storage. This prevents clients from acting on potentially stale data when the feature
532+
is off. Values written while the feature was enabled may be wiped on next update request.
503533
Re-enablement of the feature will not guarantee to keep the values written before the
504534
feature was disabled.
505535

506536
###### What happens if we reenable the feature if it was previously rolled back?
507537

508-
The pod status will be updated again. Consistency will not be guaranteed for fields written
538+
The pod status will be updated again. When the feature is re-enabled, there may be a brief period
539+
where stale values from storage reappear in the API before kubelet and controllers actuate and update
540+
the values with current device health information. This period should be kept as short as possible
541+
through normal kubelet reconciliation. Consistency will not be guaranteed for fields written
509542
before the last enablement.
510543

511544
###### Are there any tests for feature enablement/disablement?

keps/sig-node/4680-add-resource-health-to-pod-status/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,12 @@ see-also:
2424
- "/keps/sig-node/3573-device-plugin" # Device Plugin
2525

2626
# The target maturity stage in the current dev cycle for this KEP.
27-
stage: alpha #|beta|stable
27+
stage: beta #|beta|stable
2828

2929
# The most recent milestone for which work toward delivery of this KEP has been
3030
# done. This can be the current (upcoming) milestone, if it is being actively
3131
# worked on.
32-
latest-milestone: "v1.34"
32+
latest-milestone: "v1.35"
3333

3434
# The milestone at which this feature was, or is targeted to be, at each stage.
3535
milestone:

0 commit comments

Comments
 (0)