You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Failure Message Field:** The `ResourceHealth` struct includes an optional `Message` field that provides
277
+
additional human-readable context about device health status. This field enables Device Plugins and DRA drivers
278
+
to report detailed error information, failure reasons, and diagnostic information beyond the basic health status.
279
+
This enhancement improves troubleshooting capabilities for device-related failures. See
280
+
[Issue #133202](https://github.com/kubernetes/kubernetes/issues/133202) and
281
+
[PR #134506](https://github.com/kubernetes/kubernetes/pull/134506) for implementation details.
282
+
283
+
-**Device Health for Terminated Pods:** Kubelet will continue to update the device health status in PodStatus
284
+
even after a Pod has terminated (e.g., in Failed state or CrashLoopBackOff). This is critical for post-mortem
285
+
troubleshooting and enables retry policies (such as those introduced by
286
+
[KEP-3329: Retriable and non-retriable Pod failures for Jobs](https://github.com/kubernetes/enhancements/issues/3329))
287
+
to make informed decisions based on whether the failure was caused by an unhealthy device. See
288
+
[Issue #132978](https://github.com/kubernetes/kubernetes/issues/132978) for more details.
270
289
### Risks and Mitigations
271
290
272
291
There is not many risks of this KEP. The biggest risk is that Device Plugins will not be
@@ -459,8 +478,18 @@ Planned tests will cover the user-visible behavior of the feature:
459
478
460
479
#### Beta
461
480
481
+
The following requirements must be met for Beta graduation in v1.35:
482
+
462
483
- Complete e2e tests coverage
463
-
- Verify configurable device health check timeout implementation works correctly across different plugin vendors (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
484
+
-**Device Health for Terminated Pods** ([Issue #132978](https://github.com/kubernetes/kubernetes/issues/132978)):
485
+
Ensure that device health status is correctly reported and updated in PodStatus even after a Pod has terminated.
486
+
This is critical for troubleshooting and allowing retry policies to make informed decisions based on device health.
487
+
-**Configurable Device Health Check Timeout** ([Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118), [PR #133752](https://github.com/kubernetes/kubernetes/pull/133752)):
488
+
Verify that the configurable device health check timeout implementation (via `health_check_timeout_seconds` field)
489
+
works correctly across different plugin vendors and hardware types (e.g., GPUs, FPGAs, TPUs, storage devices).
0 commit comments