KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

nojnhuh · 2025-08-07T21:22:33Z

One-line PR description: Add KEP to enable DRA drivers to store device health and other device status in the ResourceSlice

Issue link: DRA: ResourceSlice Status for Device Health Tracking #5283

Other comments:

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

k8s-ci-robot · 2025-08-07T21:22:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nojnhuh · 2025-08-07T21:24:36Z

This is far from complete, but I'd like some feedback on the Summary and Motivation sections to make sure the problem is scoped appropriately. I'd also like some help figuring out if the high-level ideas in the "Design Details" section are worth pursuing further or if one of the Alternatives seems like a better place to start.

bg-chun · 2025-08-09T05:29:08Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+#### Enabling Automated Remediation
+
+As a cluster administrator, I want to determine how to remediate unhealthy
+devices when different failure modes require different methods of remediation.


I’m wondering how DeviceUnhealthy would be consumed for this user story.
Don’t we need some kind of unhealthy reason or device conditions for this?

Yes, definitely. I was hoping to get some consensus on which high-level approach to start with before defining the entire API though.

Before one can determine how to remediate an issue, one needs to know what are the issues.

I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:

Device running too hot / fan not working

Taint device to reduce its load

Notify admin to check cooling

Device memory (ECC) errors

If recurring non-recoverable ones, taint device and notify admin to check/replace memory

Device power delivery issues

Taint device to reduce its load

Notify admin to check PSU

Device power usage throttling

If frequent, taint device to reduce its load and notify admin to check device FW power limits

Overuse of shared device or workload OOMs

Taint device to reduce its load

If recurring frequently, notify admin to check workload resource requests

Device link quality / stability issues

Prefer devices with better link quality => resource request should specify required link BW

If severe enough, ban multi-device workloads and notify admin to investigate

Specific workload hangs / increased user-space driver error counters

Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)

Alert admin / dev to investigate that workload

Old / buggy device FW

If there's a set of workloads that work, and do not work correctly with that, use taints

Schedule FW upgrade, and taint device during upgrade

Device hangs / increased kernel driver / FW / HW error counters

Reset specific device part (e.g. compute)

Drain device and reset it

With too many device resets / error increases, taint device and alert admin

Drain all devices on same bus and reset bus

Drain whole node and reset it

Schedule device firmware update

Schedule device replacement

(First ones can be done by (kernel) driver automatically, last ones require admin)

everpeace · 2025-08-12T14:47:26Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {
+	// State is the overall health status of a device.
+	//
+	// +required
+	State DeviceHealthState `json:"state"`
+}


How about introducing a bit more info?? I think we can borrow several fields from PodCondition. For example:

Suggested change

// DeviceHealthStatus represents the health of a device as observed by the driver.

type DeviceHealthStatus struct {

// State is the overall health status of a device.

//

// +required

State DeviceHealthState `json:"state"`

}

// DeviceHealthStatus represents the health of a device as observed by the driver.

type DeviceHealthStatus struct {

// State is the overall health status of a device.

//

// +required

State DeviceHealthState `json:"state"`

// Reason is the reason of this device health. It could be helpful especially when the state is "Unhealthy".

// +optional

Reason string `json:"reason"`

// LastTransitionTime is the last time the device health transitioned from one state to another.

// +required

LastTransitionTime string `json:"lastTransitionTime"`

// LastReportedTime is the last reported time for the device health from the driver.

// +required

LastReportedTime string `json:"lastReportedTime"`

}

Yes, more info like this is necessary for this to be useful. I was hoping to get some feedback on if this high-level approach is worth pursuing or if one of the alternatives listed below is a better place to start getting into more of the details of the API.

Add TODO to fill out API

Jpsassine · 2025-08-12T22:27:50Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


The dra-health/v1alpha1 gRPC service implemented in #130606 already provides a stream of health updates from the DRA plugin to the Kubelet. This same gRPC service could be leveraged as the SoT for populating this new ResourceSlice.status field.

PR kubernetes/kubernetes#130606 introduced

type DeviceHealthStatus string const ( // DeviceHealthStatusHealthy represents a healthy device. DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy" // DeviceHealthStatusUnhealthy represents an unhealthy device. DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy" // DeviceHealthStatusUnknown represents a device with unknown health status. DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown" )

@Jpsassine Does that status only surface for devices that are currently allocated to a Pod? If an unallocated device becomes unhealthy, is that visible anywhere in the Kubernetes API?

@nojnhuh Yes, so the health status from my PR surfaces health only for devices that are currently allocated to a Pod, which is reported via the new pod.status.containerStatuses.allocatedResourcesStatus field.

However, it seems this KEP-5283 could exactly address this visibility gap by adding the health status of all the devices to the resource slice.

Regardless of what we surface today, the DRA plugins who implement the new gRPC service DRAResourceHealthwill be streaming all device healths associated with it.

@Jpsassine How might we expose the health of devices that are accessible from multiple Nodes, like network attached devices? Does the kubelet on each Node compute the health of the device separately? Is it possible that two Nodes might have differing opinions on the health of the same device? I'm wondering if this KEP would need to define a way to express the health of a device with respect to each Node that could attach it.

I believe the DRA driver is the source of truth for device health here, not the Kubelet. In the architecture I implemented for KEP-4680, the kubelet acts as a client that consumes health status streamed from the node-local DRA plugin via the DRAResourceHealth gRPC service. This design inherently handles the possibility of differing health perspectives between nodes(although, I don't see how there could be a legitimate discrepancy of the same device health between nodes). Since a ResourceSlice is published by the DRA driver running on a specific node, the health status it contains would naturally reflect the device's condition from that node's perspective.

Example assuming the device healths are used to populate ResourceSlice device statuses:

If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as Unhealthy in the ResourceSlice for that node.

Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as Healthy in its ResourceSlice.

Although, I think this would be odd, it shows that the current model should account for this scenario where one node has the same device as healthy and another as unhealthy.

@SergeyKanzhelev, please correct me if I am wrong, but to the best of my understanding this is how the device health works with DRA now.

ArangoGutierrez · 2025-08-13T09:42:52Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+some action to restore the device to a healthy state. This KEP defines a
+standard way to determine whether or not a device is considered healthy.


This sounds like a bigger scope than the KEP title.

Suggested change

some action to restore the device to a healthy state. This KEP defines a

standard way to determine whether or not a device is considered healthy.

Some action is r to restore the device to a healthy state. This KEP proposes

A new entry to the ResourceSlice object to allow DRA drivers to report whether or not a device is considered healthy.

I think I would prefer to change the title then if this statement is too far off from it.

@johnbelamaric Would it be appropriate to retitle #5283 something like "DRA: Device Health Status" that doesn't imply any particular solution like adding a new status field in ResourceSlice but specific enough to differentiate it from #4680?

ArangoGutierrez · 2025-08-13T09:43:59Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
+  made by each DRA driver.


This non-goal collides with what you said on the summary section, line 184

This non-goal is only saying that Kubernetes doesn't care about the underlying characteristics of a device that cause a driver to consider it healthy or not. The summary says cluster administrators are interested in identifying and remediating unhealthy devices. Are those at odds with each other?

IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.

This KEP describes where health information can be found and its general structure. DRA drivers populate that health information in the API. Cluster admins use that to help identify and remediate issues.

I don't see where any of that conflicts?

So you're saying that in addition to non-standard information admin requires to actually do something about the health issue, there would be a standard health flag, which admin would monitor to see whether there's a need to look further?

This immediately raises the question that who then decides and configures which conditions trigger such flag.

Because if the flag is raised on things that are irrelevant for the admin, or it's not raised on things that admin cares about, it's not really helping, admin would need to follow the non-standard info anyway.

ArangoGutierrez · 2025-08-13T09:44:32Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


ArangoGutierrez · 2025-08-13T09:46:58Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+}
+
+// DeviceHealthStatus represents the health of a device as observed by the driver.
+type DeviceHealthStatus struct {


PR kubernetes/kubernetes#130606 introduced

type DeviceHealthStatus string const ( // DeviceHealthStatusHealthy represents a healthy device. DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy" // DeviceHealthStatusUnhealthy represents an unhealthy device. DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy" // DeviceHealthStatusUnknown represents a device with unknown health status. DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown" )

Add alternative for vendor-provided metrics

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

nojnhuh · 2025-08-13T22:25:36Z

This is still technically "in-progress" in that it's not ready to merge right now, but I'm ready for early feedback on what's there now to help me fill out the rest of the KEP.

Removing WIP:
/retitle KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

eero-t · 2025-08-14T16:27:22Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+[taint](https://kep.k8s.io/5055) the devices. A standard representation of
+device health in the ResourceSlice API is needed to express the state of the
+devices in cases where the side effects (`NoSchedule`/`NoExecute`) of taints are
+not desired.


This should state what else than NoSchedule/NoExecute is needed, something that has no effect and is just informational?

If yes, that's what the cluster device telemetry, and metric alerts are for...

This section only mentions taints and tolerations as the most closely related mechanism at the moment to describe health. Proposing modifications to taints is better fit for the Proposal or Alternatives sections than the Summary section.

No, what I meant is that this (or the new "No Side-Effects By Default" section) does not explain the motivation for this KEP. I.e. why a no-effect health flag is desirable (especially one with no details of what are the actual health issues).

eero-t · 2025-08-14T16:34:23Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+As a cluster administrator, I want to determine the overall health of the
+DRA-exposed devices available throughout my cluster.


This could done with device taints and a tool listing them (e.g. listing all device taints, or ones matching given pattern for nodes, devices or taints). Or if alerts are generated for those taints, viewing them from the alertmanager GUI.

=> User story needs to be something where taints are not enough.

I don't think we need to change this story for the purpose of disqualifying taints if taints are otherwise an acceptable solution. The Motivation section does describe the need for a solution without side-effects though, so I'll add a new story for that to avoid any single story from describing too many things at once.

My comment was because motivation section states need for purely informational health flag, but does not state why it's needed, by whom, or what's the use-case for such no-effect health information.

(If device is still supposed to be used by all potential workloads, it doesn't seem particularly unhealthy?)

eero-t · 2025-08-14T17:03:45Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
+  made by each DRA driver.


IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.

eero-t · 2025-08-14T17:26:51Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+#### Enabling Automated Remediation
+
+As a cluster administrator, I want to determine how to remediate unhealthy
+devices when different failure modes require different methods of remediation.


Before one can determine how to remediate an issue, one needs to know what are the issues.

I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:

Device running too hot / fan not working

Taint device to reduce its load

Notify admin to check cooling

Device memory (ECC) errors

If recurring non-recoverable ones, taint device and notify admin to check/replace memory

Device power delivery issues

Taint device to reduce its load

Notify admin to check PSU

Device power usage throttling

If frequent, taint device to reduce its load and notify admin to check device FW power limits

Overuse of shared device or workload OOMs

Taint device to reduce its load

If recurring frequently, notify admin to check workload resource requests

Device link quality / stability issues

Prefer devices with better link quality => resource request should specify required link BW

If severe enough, ban multi-device workloads and notify admin to investigate

Specific workload hangs / increased user-space driver error counters

Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)

Alert admin / dev to investigate that workload

Old / buggy device FW

If there's a set of workloads that work, and do not work correctly with that, use taints

Schedule FW upgrade, and taint device during upgrade

Device hangs / increased kernel driver / FW / HW error counters

Reset specific device part (e.g. compute)

Drain device and reset it

With too many device resets / error increases, taint device and alert admin

Drain all devices on same bus and reset bus

Drain whole node and reset it

Schedule device firmware update

Schedule device replacement

(First ones can be done by (kernel) driver automatically, last ones require admin)

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

eero-t · 2025-08-14T17:51:01Z

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md

+The main cost of that flexibility is the lack of standardization, where cluster
+administrators have to track down from each vendor how to determine if a given
+device is in a healthy state as opposed to inspecting a well-defined area of a
+vendor-agnostic API like ResourceSlice. This lack of standardization also makes
+integrations like generic controllers that automatically taint unhealthy devices
+less straightforward to implement.


There is a OpenTelemetry standard for the metrics: https://opentelemetry.io/docs/specs/semconv/

(One of the goals of that standardizations is providing e.g. drill-down support from whole node power usage, to power usage of individual components inside that.)

Admittedly it's still a rather WIP in regards to health related device metrics: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

See my list above and e.g:

hw.host.power/energy versus hw.power/energy metrics open-telemetry/semantic-conventions#1055

Issues with Hardware Metrics semantic conventions open-telemetry/semantic-conventions#940

Device telemetry stacks provided by the vendors most likely haven't adopted it yet either...

Even with a standard way to determine certain values like fan speed or battery level, vendors need to document what those mean w.r.t. how healthy a device is. I think that's an acceptable way to consider implementing this KEP, but is a step down in some ways to including an overall "healthy"/"unhealthy" signal that could be identical for every kind of device.

Some information / metrics can be rather self-evident (e.g. fan failed). As to rest of metrics, you may have somewhat optimistic view of how much vendor (k8s driver developers) know of their health impact.

How given set of (less obvious) metrics matches to long term health of given device, and at what probability over what time interval, is information that's more likely to be in possession of large cluster operators and their admins.

(HW vendors do not constantly run production workloads in huge clusters, and collect metrics & health statistics of their working, their customer do that, and I suspect they're unlikely to share that info with anybody, even their HW vendor, except to fix specific issues, maybe just for specific team / persons.)

Add user story for purely informational status

Simplify wording in metrics alternative

nojnhuh added 2 commits August 6, 2025 15:28

Copy KEP template

920f205

amend! 920f205

44c3a80

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

k8s-ci-robot requested a review from johnbelamaric August 7, 2025 21:22

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025

bg-chun reviewed Aug 9, 2025

View reviewed changes

everpeace reviewed Aug 12, 2025

View reviewed changes

fixup! Copy KEP template

bebbb87

Add TODO to fill out API

Jpsassine reviewed Aug 12, 2025

View reviewed changes

ArangoGutierrez reviewed Aug 13, 2025

View reviewed changes

ArangoGutierrez mentioned this pull request Aug 13, 2025

DRA: ResourceSlice Status for Device Health Tracking #5283

Open

4 tasks

fixup! Copy KEP template

9a961e5

Add alternative for vendor-provided metrics

nojnhuh commented Aug 13, 2025

View reviewed changes

keps/sig-node/5283-dra-resourceslice-status-device-health/README.md Show resolved Hide resolved

k8s-ci-robot changed the title ~~WIP: KEP-5283: DRA: ResourceSlice Status for Device Health Tracking~~ KEP-5283: DRA: ResourceSlice Status for Device Health Tracking Aug 13, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025

eero-t reviewed Aug 14, 2025

View reviewed changes

nojnhuh added 2 commits August 14, 2025 16:23

fixup! Copy KEP template

8a45cd2

Add user story for purely informational status

fixup! Copy KEP template

a34a32d

Simplify wording in metrics alternative

		some action to restore the device to a healthy state. This KEP defines a
		standard way to determine whether or not a device is considered healthy.

		- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
		made by each DRA driver.

		As a cluster administrator, I want to determine the overall health of the
		DRA-exposed devices available throughout my cluster.

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

Are you sure you want to change the base?

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

Conversation

nojnhuh commented Aug 7, 2025

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

nojnhuh commented Aug 7, 2025

Uh oh!

bg-chun Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nojnhuh commented Aug 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bg-chun Aug 9, 2025 •

edited

Loading

eero-t Aug 18, 2025 •

edited

Loading

eero-t Aug 18, 2025 •

edited

Loading