Merge pull request #54420 from harche/dev-1.36

k8s-ci-robot · web-flow · commit 8d4769ad6b18 · 2026-04-07T01:27:32.000+05:30
[KEP-4680]: Update ResourceHealthStatus documentation for Beta in v1.36
diff --git a/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md b/content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md
@@ -189,14 +189,16 @@ failed device is to use the [PodResources API](#monitoring-device-plugin-resourc
 
 {{< feature-state feature_gate_name="ResourceHealthStatus" >}}
 
-By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`
-will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
-field
-reports health information for each device assigned to the container.
+When the feature gate `ResourceHealthStatus` is enabled (beta and enabled by default since v1.36),
+the field `allocatedResourcesStatus`
+is added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
+field reports health information for each device assigned to the container.
+Each resource health entry can include an optional `message` field with additional
+human readable context about the health status, such as error details or failure reasons.
 
 For a failed Pod, or where you suspect a fault, you can use this status to understand whether
 the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
-an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.
+an over-temperature event, the `allocatedResourcesStatus` field may report this.
 
 
 ## Device plugin deployment
diff --git a/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md b/content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md
@@ -402,21 +402,22 @@ For details about the `status.devices` field, see the
 
 {{< feature-state feature_gate_name="ResourceHealthStatus" >}}
 
-As an alpha feature, Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
-For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy.
-It is also helpful to find out if the device recovers.
+Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
+For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers.
 
-To enable this functionality, the [`ResourceHealthStatus` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#ResourceHealthStatus)
-must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.
+To use this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled (beta and enabled by default since v1.36), and the DRA driver must implement the `DRAResourceHealth` gRPC service.
 
-When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet.
-This health information is then exposed directly in the Pod's status.
-The kubelet populates the `allocatedResourcesStatus` field in the status of each container,
-detailing the health of each device assigned to that container.
+When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container. Each resource health entry can include an optional `message` field with additional human-readable context about the health status, such as error details or failure reasons.
+
+If the kubelet does not receive a health update from a DRA driver within a timeout period, the device's health status is marked as "Unknown". DRA drivers can configure this timeout on a per-device basis by setting the `health_check_timeout_seconds` field in the `DeviceHealth` gRPC message. If not specified, the kubelet uses a default timeout of 30 seconds. This allows different hardware types (for example, GPUs, FPGAs, or storage devices) to use appropriate timeout values based on their health-reporting characteristics.
 
 This provides crucial visibility for users and controllers to react to hardware failures.
 For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device.
 
+{{< note >}}
+Device health status is not updated in the Pod status after a Pod has terminated (for example, in Failed state).
+{{< /note >}}
+
 ## Pre-scheduled Pods
 
 When you - or another API client - create a Pod with `spec.nodeName` already set, the scheduler gets bypassed.
diff --git a/content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md b/content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md
@@ -6,12 +6,20 @@ _build:
   render: false
 
 stages:
-  - stage: alpha 
+  - stage: alpha
     defaultValue: false
     fromVersion: "1.31"
+    toVersion: "1.35"
+  - stage: beta
+    defaultValue: true
+    fromVersion: "1.36"
 ---
 Enable the `allocatedResourcesStatus` field within the `.status` for a Pod. The field
 reports additional details for each container in the Pod,
 with the health information for each device assigned to the Pod.
 
+Starting in v1.36 (beta), the health report includes an optional `message` field that
+provides additional human-readable context about the health status, such as error details
+or failure reasons.
+
 This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.