You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As an alpha feature, Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
406
-
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy.
407
-
It is also helpful to find out if the device recovers.
405
+
Kubernetes provides a mechanism for monitoring and reporting the health of dynamically allocated infrastructure resources.
406
+
For stateful applications running on specialized hardware, it is critical to know when a device has failed or become unhealthy. It is also helpful to find out if the device recovers.
408
407
409
-
To enable this functionality, the [`ResourceHealthStatus` feature gate](/docs/reference/command-line-tools-reference/feature-gates/#ResourceHealthStatus)
410
-
must be enabled, and the DRA driver must implement the `DRAResourceHealth` gRPC service.
408
+
To use this functionality, the `ResourceHealthStatus` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/resource-health-status/) must be enabled (beta and enabled by default since v1.36), and the DRA driver must implement the `DRAResourceHealth` gRPC service.
411
409
412
-
When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet.
413
-
This health information is then exposed directly in the Pod's status.
414
-
The kubelet populates the `allocatedResourcesStatus` field in the status of each container,
415
-
detailing the health of each device assigned to that container.
410
+
When a DRA driver detects that an allocated device has become unhealthy, it reports this status back to the kubelet. This health information is then exposed directly in the Pod's status. The kubelet populates the `allocatedResourcesStatus` field in the status of each container, detailing the health of each device assigned to that container. Each resource health entry can include an optional `message` field with additional human-readable context about the health status, such as error details or failure reasons.
411
+
412
+
If the kubelet does not receive a health update from a DRA driver within a timeout period, the device's health status is marked as "Unknown". DRA drivers can configure this timeout on a per-device basis by setting the `health_check_timeout_seconds` field in the `DeviceHealth` gRPC message. If not specified, the kubelet uses a default timeout of 30 seconds. This allows different hardware types (for example, GPUs, FPGAs, or storage devices) to use appropriate timeout values based on their health-reporting characteristics.
416
413
417
414
This provides crucial visibility for users and controllers to react to hardware failures.
418
415
For a Pod that is failing, you can inspect this status to determine if the failure was related to an unhealthy device.
419
416
417
+
{{< note >}}
418
+
Device health status is not updated in the Pod status after a Pod has terminated (for example, in Failed state).
419
+
{{< /note >}}
420
+
420
421
## Pre-scheduled Pods
421
422
422
423
When you - or another API client - create a Pod with `spec.nodeName` already set, the scheduler gets bypassed.
Copy file name to clipboardExpand all lines: content/en/docs/reference/command-line-tools-reference/feature-gates/ResourceHealthStatus.md
+9-1Lines changed: 9 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,12 +6,20 @@ _build:
6
6
render: false
7
7
8
8
stages:
9
-
- stage: alpha
9
+
- stage: alpha
10
10
defaultValue: false
11
11
fromVersion: "1.31"
12
+
toVersion: "1.35"
13
+
- stage: beta
14
+
defaultValue: true
15
+
fromVersion: "1.36"
12
16
---
13
17
Enable the `allocatedResourcesStatus` field within the `.status` for a Pod. The field
14
18
reports additional details for each container in the Pod,
15
19
with the health information for each device assigned to the Pod.
16
20
21
+
Starting in v1.36 (beta), the health report includes an optional `message` field that
22
+
provides additional human-readable context about the health status, such as error details
23
+
or failure reasons.
24
+
17
25
This feature applies to devices managed by both [Device Plugins](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) and [Dynamic Resource Allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-health-monitoring). See [Device plugin and unhealthy devices](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-and-unhealthy-devices) for more details.
0 commit comments