Skip to content

Commit 8d4769a

Browse files
authored
Merge pull request #54420 from harche/dev-1.36
[KEP-4680]: Update ResourceHealthStatus documentation for Beta in v1.36
2 parents 85be459 + 2353c81 commit 8d4769a

File tree

3 files changed

+26
-15
lines changed

3 files changed

+26
-15
lines changed

content/en/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -189,14 +189,16 @@ failed device is to use the [PodResources API](#monitoring-device-plugin-resourc
189189

190190
{{< feature-state feature_gate_name="ResourceHealthStatus" >}}
191191

192-
By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`
193-
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
194-
field
195-
reports health information for each device assigned to the container.
192+
When the feature gate `ResourceHealthStatus` is enabled (beta and enabled by default since v1.36),
193+
the field `allocatedResourcesStatus`
194+
is added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
195+
field reports health information for each device assigned to the container.
196+
Each resource health entry can include an optional `message` field with additional
197+
human readable context about the health status, such as error details or failure reasons.
196198

197199
For a failed Pod, or where you suspect a fault, you can use this status to understand whether
198200
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
199-
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.
201+
an over-temperature event, the `allocatedResourcesStatus` field may report this.
200202

201203

202204
## Device plugin deployment

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -402,21 +402,22 @@ For details about the `status.devices` field, see the
402402

403403
{{< feature-state feature_gate_name="ResourceHealthStatus" >}}
404404

405-
As an alpha feature, Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
406-
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy.
407-
It is also helpful to find out if the device recovers.
405+
Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
406+
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers.
408407

409-
To enable this functionality, the [`ResourceHealthStatus` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#ResourceHealthStatus)
410-
must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.
408+
To use this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled (beta and enabled by default since v1.36), and the DRA driver must implement the `DRAResourceHealth` gRPC service.
411409

412-
When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet.
413-
This health information is then exposed directly in the Pod's status.
414-
The kubelet populates the `allocatedResourcesStatus` field in the status of each container,
415-
detailing the health of each device assigned to that container.
410+
When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container. Each resource health entry can include an optional `message` field with additional human-readable context about the health status, such as error details or failure reasons.
411+
412+
If the kubelet does not receive a health update from a DRA driver within a timeout period, the device's health status is marked as "Unknown". DRA drivers can configure this timeout on a per-device basis by setting the `health_check_timeout_seconds` field in the `DeviceHealth` gRPC message. If not specified, the kubelet uses a default timeout of 30 seconds. This allows different hardware types (for example, GPUs, FPGAs, or storage devices) to use appropriate timeout values based on their health-reporting characteristics.
416413

417414
This provides crucial visibility for users and controllers to react to hardware failures.
418415
For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device.
419416

417+
{{< note >}}
418+
Device health status is not updated in the Pod status after a Pod has terminated (for example, in Failed state).
419+
{{< /note >}}
420+
420421
## Pre-scheduled Pods
421422

422423
When you - or another API client - create a Pod with `spec.nodeName` already set, the scheduler gets bypassed.

content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,20 @@ _build:
66
render: false
77

88
stages:
9-
- stage: alpha
9+
- stage: alpha
1010
defaultValue: false
1111
fromVersion: "1.31"
12+
toVersion: "1.35"
13+
- stage: beta
14+
defaultValue: true
15+
fromVersion: "1.36"
1216
---
1317
Enable the `allocatedResourcesStatus` field within the `.status` for a Pod. The field
1418
reports additional details for each container in the Pod,
1519
with the health information for each device assigned to the Pod.
1620

21+
Starting in v1.36 (beta), the health report includes an optional `message` field that
22+
provides additional human-readable context about the health status, such as error details
23+
or failure reasons.
24+
1725
This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.

0 commit comments

Comments
 (0)