Skip to content

KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

nojnhuh
Copy link

@nojnhuh nojnhuh commented Aug 7, 2025

  • One-line PR description: Add KEP to enable DRA drivers to store device health and other device status in the ResourceSlice
  • Other comments:

/cc @johnbelamaric

nojnhuh added 2 commits August 6, 2025 15:28
KEP-5283: DRA: ResourceSlice Status for Device Health Tracking
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nojnhuh
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Aug 7, 2025
@nojnhuh
Copy link
Author

nojnhuh commented Aug 7, 2025

This is far from complete, but I'd like some feedback on the Summary and Motivation sections to make sure the problem is scoped appropriately. I'd also like some help figuring out if the high-level ideas in the "Design Details" section are worth pursuing further or if one of the Alternatives seems like a better place to start.

Comment on lines +257 to +260
#### Enabling Automated Remediation

As a cluster administrator, I want to determine how to remediate unhealthy
devices when different failure modes require different methods of remediation.
Copy link
Member

@bg-chun bg-chun Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m wondering how DeviceUnhealthy would be consumed for this user story.
Don’t we need some kind of unhealthy reason or device conditions for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely. I was hoping to get some consensus on which high-level approach to start with before defining the entire API though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before one can determine how to remediate an issue, one needs to know what are the issues.

I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:

  • Device running too hot / fan not working
    • Taint device to reduce its load
    • Notify admin to check cooling
  • Device memory (ECC) errors
    • If recurring non-recoverable ones, taint device and notify admin to check/replace memory
  • Device power delivery issues
    • Taint device to reduce its load
    • Notify admin to check PSU
  • Device power usage throttling
    • If frequent, taint device to reduce its load and notify admin to check device FW power limits
  • Overuse of shared device or workload OOMs
    • Taint device to reduce its load
    • If recurring frequently, notify admin to check workload resource requests
  • Device link quality / stability issues
    • Prefer devices with better link quality => resource request should specify required link BW
    • If severe enough, ban multi-device workloads and notify admin to investigate
  • Specific workload hangs / increased user-space driver error counters
    • Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)
    • Alert admin / dev to investigate that workload
  • Old / buggy device FW
    • If there's a set of workloads that work, and do not work correctly with that, use taints
    • Schedule FW upgrade, and taint device during upgrade
  • Device hangs / increased kernel driver / FW / HW error counters
    • Reset specific device part (e.g. compute)
    • Drain device and reset it
    • With too many device resets / error increases, taint device and alert admin
    • Drain all devices on same bus and reset bus
    • Drain whole node and reset it
    • Schedule device firmware update
    • Schedule device replacement
    • (First ones can be done by (kernel) driver automatically, last ones require admin)

Comment on lines 339 to 345
// DeviceHealthStatus represents the health of a device as observed by the driver.
type DeviceHealthStatus struct {
// State is the overall health status of a device.
//
// +required
State DeviceHealthState `json:"state"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about introducing a bit more info?? I think we can borrow several fields from PodCondition. For example:

Suggested change
// DeviceHealthStatus represents the health of a device as observed by the driver.
type DeviceHealthStatus struct {
// State is the overall health status of a device.
//
// +required
State DeviceHealthState `json:"state"`
}
// DeviceHealthStatus represents the health of a device as observed by the driver.
type DeviceHealthStatus struct {
// State is the overall health status of a device.
//
// +required
State DeviceHealthState `json:"state"`
// Reason is the reason of this device health. It could be helpful especially when the state is "Unhealthy".
// +optional
Reason string `json:"reason"`
// LastTransitionTime is the last time the device health transitioned from one state to another.
// +required
LastTransitionTime string `json:"lastTransitionTime"`
// LastReportedTime is the last reported time for the device health from the driver.
// +required
LastReportedTime string `json:"lastReportedTime"`
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more info like this is necessary for this to be useful. I was hoping to get some feedback on if this high-level approach is worth pursuing or if one of the alternatives listed below is a better place to start getting into more of the details of the API.

Add TODO to fill out API
}

// DeviceHealthStatus represents the health of a device as observed by the driver.
type DeviceHealthStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dra-health/v1alpha1 gRPC service implemented in #130606 already provides a stream of health updates from the DRA plugin to the Kubelet. This same gRPC service could be leveraged as the SoT for populating this new ResourceSlice.status field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR kubernetes/kubernetes#130606 introduced

type DeviceHealthStatus string

const (
	// DeviceHealthStatusHealthy represents a healthy device.
	DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy"
	// DeviceHealthStatusUnhealthy represents an unhealthy device.
	DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy"
	// DeviceHealthStatusUnknown represents a device with unknown health status.
	DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown"
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jpsassine Does that status only surface for devices that are currently allocated to a Pod? If an unallocated device becomes unhealthy, is that visible anywhere in the Kubernetes API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nojnhuh Yes, so the health status from my PR surfaces health only for devices that are currently allocated to a Pod, which is reported via the new pod.status.containerStatuses.allocatedResourcesStatus field.

However, it seems this KEP-5283 could exactly address this visibility gap by adding the health status of all the devices to the resource slice.

Regardless of what we surface today, the DRA plugins who implement the new gRPC service DRAResourceHealthwill be streaming all device healths associated with it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jpsassine How might we expose the health of devices that are accessible from multiple Nodes, like network attached devices? Does the kubelet on each Node compute the health of the device separately? Is it possible that two Nodes might have differing opinions on the health of the same device? I'm wondering if this KEP would need to define a way to express the health of a device with respect to each Node that could attach it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the DRA driver is the source of truth for device health here, not the Kubelet. In the architecture I implemented for KEP-4680, the kubelet acts as a client that consumes health status streamed from the node-local DRA plugin via the DRAResourceHealth gRPC service. This design inherently handles the possibility of differing health perspectives between nodes(although, I don't see how there could be a legitimate discrepancy of the same device health between nodes). Since a ResourceSlice is published by the DRA driver running on a specific node, the health status it contains would naturally reflect the device's condition from that node's perspective.

Example assuming the device healths are used to populate ResourceSlice device statuses:

  • If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as Unhealthy in the ResourceSlice for that node.
  • Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as Healthy in its ResourceSlice.

Although, I think this would be odd, it shows that the current model should account for this scenario where one node has the same device as healthy and another as unhealthy.

@SergeyKanzhelev, please correct me if I am wrong, but to the best of my understanding this is how the device health works with DRA now.

Comment on lines +184 to +185
some action to restore the device to a healthy state. This KEP defines a
standard way to determine whether or not a device is considered healthy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a bigger scope than the KEP title.

Suggested change
some action to restore the device to a healthy state. This KEP defines a
standard way to determine whether or not a device is considered healthy.
Some action is r to restore the device to a healthy state. This KEP proposes
A new entry to the ResourceSlice object to allow DRA drivers to report whether or not a device is considered healthy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to change the title then if this statement is too far off from it.

@johnbelamaric Would it be appropriate to retitle #5283 something like "DRA: Device Health Status" that doesn't imply any particular solution like adding a new status field in ResourceSlice but specific enough to differentiate it from #4680?

Comment on lines +227 to +228
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
made by each DRA driver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This non-goal collides with what you said on the summary section, line 184

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This non-goal is only saying that Kubernetes doesn't care about the underlying characteristics of a device that cause a driver to consider it healthy or not. The summary says cluster administrators are interested in identifying and remediating unhealthy devices. Are those at odds with each other?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This KEP describes where health information can be found and its general structure. DRA drivers populate that health information in the API. Cluster admins use that to help identify and remediate issues.

I don't see where any of that conflicts?

Copy link

@eero-t eero-t Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying that in addition to non-standard information admin requires to actually do something about the health issue, there would be a standard health flag, which admin would monitor to see whether there's a need to look further?

This immediately raises the question that who then decides and configures which conditions trigger such flag.

Because if the flag is raised on things that are irrelevant for the admin, or it's not raised on things that admin cares about, it's not really helping, admin would need to follow the non-standard info anyway.

}

// DeviceHealthStatus represents the health of a device as observed by the driver.
type DeviceHealthStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

}

// DeviceHealthStatus represents the health of a device as observed by the driver.
type DeviceHealthStatus struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR kubernetes/kubernetes#130606 introduced

type DeviceHealthStatus string

const (
	// DeviceHealthStatusHealthy represents a healthy device.
	DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy"
	// DeviceHealthStatusUnhealthy represents an unhealthy device.
	DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy"
	// DeviceHealthStatusUnknown represents a device with unknown health status.
	DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown"
)

Add alternative for vendor-provided metrics
@nojnhuh
Copy link
Author

nojnhuh commented Aug 13, 2025

This is still technically "in-progress" in that it's not ready to merge right now, but I'm ready for early feedback on what's there now to help me fill out the rest of the KEP.

Removing WIP:
/retitle KEP-5283: DRA: ResourceSlice Status for Device Health Tracking

@k8s-ci-robot k8s-ci-robot changed the title WIP: KEP-5283: DRA: ResourceSlice Status for Device Health Tracking KEP-5283: DRA: ResourceSlice Status for Device Health Tracking Aug 13, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2025
Comment on lines +206 to +209
[taint](https://kep.k8s.io/5055) the devices. A standard representation of
device health in the ResourceSlice API is needed to express the state of the
devices in cases where the side effects (`NoSchedule`/`NoExecute`) of taints are
not desired.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should state what else than NoSchedule/NoExecute is needed, something that has no effect and is just informational?

If yes, that's what the cluster device telemetry, and metric alerts are for...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section only mentions taints and tolerations as the most closely related mechanism at the moment to describe health. Proposing modifications to taints is better fit for the Proposal or Alternatives sections than the Summary section.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I meant is that this (or the new "No Side-Effects By Default" section) does not explain the motivation for this KEP. I.e. why a no-effect health flag is desirable (especially one with no details of what are the actual health issues).

Comment on lines +255 to +256
As a cluster administrator, I want to determine the overall health of the
DRA-exposed devices available throughout my cluster.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could done with device taints and a tool listing them (e.g. listing all device taints, or ones matching given pattern for nodes, devices or taints). Or if alerts are generated for those taints, viewing them from the alertmanager GUI.

=> User story needs to be something where taints are not enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to change this story for the purpose of disqualifying taints if taints are otherwise an acceptable solution. The Motivation section does describe the need for a solution without side-effects though, so I'll add a new story for that to avoid any single story from describing too many things at once.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was because motivation section states need for purely informational health flag, but does not state why it's needed, by whom, or what's the use-case for such no-effect health information.

(If device is still supposed to be used by all potential workloads, it doesn't seem particularly unhealthy?)

Comment on lines +227 to +228
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is
made by each DRA driver.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.

Comment on lines +257 to +260
#### Enabling Automated Remediation

As a cluster administrator, I want to determine how to remediate unhealthy
devices when different failure modes require different methods of remediation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before one can determine how to remediate an issue, one needs to know what are the issues.

I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:

  • Device running too hot / fan not working
    • Taint device to reduce its load
    • Notify admin to check cooling
  • Device memory (ECC) errors
    • If recurring non-recoverable ones, taint device and notify admin to check/replace memory
  • Device power delivery issues
    • Taint device to reduce its load
    • Notify admin to check PSU
  • Device power usage throttling
    • If frequent, taint device to reduce its load and notify admin to check device FW power limits
  • Overuse of shared device or workload OOMs
    • Taint device to reduce its load
    • If recurring frequently, notify admin to check workload resource requests
  • Device link quality / stability issues
    • Prefer devices with better link quality => resource request should specify required link BW
    • If severe enough, ban multi-device workloads and notify admin to investigate
  • Specific workload hangs / increased user-space driver error counters
    • Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)
    • Alert admin / dev to investigate that workload
  • Old / buggy device FW
    • If there's a set of workloads that work, and do not work correctly with that, use taints
    • Schedule FW upgrade, and taint device during upgrade
  • Device hangs / increased kernel driver / FW / HW error counters
    • Reset specific device part (e.g. compute)
    • Drain device and reset it
    • With too many device resets / error increases, taint device and alert admin
    • Drain all devices on same bus and reset bus
    • Drain whole node and reset it
    • Schedule device firmware update
    • Schedule device replacement
    • (First ones can be done by (kernel) driver automatically, last ones require admin)

Comment on lines +1026 to +1031
The main cost of that flexibility is the lack of standardization, where cluster
administrators have to track down from each vendor how to determine if a given
device is in a healthy state as opposed to inspecting a well-defined area of a
vendor-agnostic API like ResourceSlice. This lack of standardization also makes
integrations like generic controllers that automatically taint unhealthy devices
less straightforward to implement.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a OpenTelemetry standard for the metrics: https://opentelemetry.io/docs/specs/semconv/

(One of the goals of that standardizations is providing e.g. drill-down support from whole node power usage, to power usage of individual components inside that.)

Admittedly it's still a rather WIP in regards to health related device metrics: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/

See my list above and e.g:

Device telemetry stacks provided by the vendors most likely haven't adopted it yet either...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with a standard way to determine certain values like fan speed or battery level, vendors need to document what those mean w.r.t. how healthy a device is. I think that's an acceptable way to consider implementing this KEP, but is a step down in some ways to including an overall "healthy"/"unhealthy" signal that could be identical for every kind of device.

Copy link

@eero-t eero-t Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some information / metrics can be rather self-evident (e.g. fan failed). As to rest of metrics, you may have somewhat optimistic view of how much vendor (k8s driver developers) know of their health impact.

How given set of (less obvious) metrics matches to long term health of given device, and at what probability over what time interval, is information that's more likely to be in possession of large cluster operators and their admins.

(HW vendors do not constantly run production workloads in huge clusters, and collect metrics & health statistics of their working, their customer do that, and I suspect they're unlikely to share that info with anybody, even their HW vendor, except to fix specific issues, maybe just for specific team / persons.)

Add user story for purely informational status
Simplify wording in metrics alternative
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants