-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5283: DRA: ResourceSlice Status for Device Health Tracking #5469
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nojnhuh The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is far from complete, but I'd like some feedback on the Summary and Motivation sections to make sure the problem is scoped appropriately. I'd also like some help figuring out if the high-level ideas in the "Design Details" section are worth pursuing further or if one of the Alternatives seems like a better place to start. |
#### Enabling Automated Remediation | ||
|
||
As a cluster administrator, I want to determine how to remediate unhealthy | ||
devices when different failure modes require different methods of remediation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m wondering how DeviceUnhealthy would be consumed for this user story.
Don’t we need some kind of unhealthy reason or device conditions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely. I was hoping to get some consensus on which high-level approach to start with before defining the entire API though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before one can determine how to remediate an issue, one needs to know what are the issues.
I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:
- Device running too hot / fan not working
- Taint device to reduce its load
- Notify admin to check cooling
- Device memory (ECC) errors
- If recurring non-recoverable ones, taint device and notify admin to check/replace memory
- Device power delivery issues
- Taint device to reduce its load
- Notify admin to check PSU
- Device power usage throttling
- If frequent, taint device to reduce its load and notify admin to check device FW power limits
- Overuse of shared device or workload OOMs
- Taint device to reduce its load
- If recurring frequently, notify admin to check workload resource requests
- Device link quality / stability issues
- Prefer devices with better link quality => resource request should specify required link BW
- If severe enough, ban multi-device workloads and notify admin to investigate
- Specific workload hangs / increased user-space driver error counters
- Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)
- Alert admin / dev to investigate that workload
- Old / buggy device FW
- If there's a set of workloads that work, and do not work correctly with that, use taints
- Schedule FW upgrade, and taint device during upgrade
- Device hangs / increased kernel driver / FW / HW error counters
- Reset specific device part (e.g. compute)
- Drain device and reset it
- With too many device resets / error increases, taint device and alert admin
- Drain all devices on same bus and reset bus
- Drain whole node and reset it
- Schedule device firmware update
- Schedule device replacement
- (First ones can be done by (kernel) driver automatically, last ones require admin)
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { | ||
// State is the overall health status of a device. | ||
// | ||
// +required | ||
State DeviceHealthState `json:"state"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about introducing a bit more info?? I think we can borrow several fields from PodCondition. For example:
// DeviceHealthStatus represents the health of a device as observed by the driver. | |
type DeviceHealthStatus struct { | |
// State is the overall health status of a device. | |
// | |
// +required | |
State DeviceHealthState `json:"state"` | |
} | |
// DeviceHealthStatus represents the health of a device as observed by the driver. | |
type DeviceHealthStatus struct { | |
// State is the overall health status of a device. | |
// | |
// +required | |
State DeviceHealthState `json:"state"` | |
// Reason is the reason of this device health. It could be helpful especially when the state is "Unhealthy". | |
// +optional | |
Reason string `json:"reason"` | |
// LastTransitionTime is the last time the device health transitioned from one state to another. | |
// +required | |
LastTransitionTime string `json:"lastTransitionTime"` | |
// LastReportedTime is the last reported time for the device health from the driver. | |
// +required | |
LastReportedTime string `json:"lastReportedTime"` | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, more info like this is necessary for this to be useful. I was hoping to get some feedback on if this high-level approach is worth pursuing or if one of the alternatives listed below is a better place to start getting into more of the details of the API.
Add TODO to fill out API
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dra-health/v1alpha1 gRPC service implemented in #130606 already provides a stream of health updates from the DRA plugin to the Kubelet. This same gRPC service could be leveraged as the SoT for populating this new ResourceSlice.status
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR kubernetes/kubernetes#130606 introduced
type DeviceHealthStatus string
const (
// DeviceHealthStatusHealthy represents a healthy device.
DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy"
// DeviceHealthStatusUnhealthy represents an unhealthy device.
DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy"
// DeviceHealthStatusUnknown represents a device with unknown health status.
DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jpsassine Does that status only surface for devices that are currently allocated to a Pod? If an unallocated device becomes unhealthy, is that visible anywhere in the Kubernetes API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nojnhuh Yes, so the health status from my PR surfaces health only for devices that are currently allocated to a Pod, which is reported via the new pod.status.containerStatuses.allocatedResourcesStatus
field.
However, it seems this KEP-5283 could exactly address this visibility gap by adding the health status of all the devices to the resource slice.
Regardless of what we surface today, the DRA plugins who implement the new gRPC service DRAResourceHealth
will be streaming all device healths associated with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jpsassine How might we expose the health of devices that are accessible from multiple Nodes, like network attached devices? Does the kubelet on each Node compute the health of the device separately? Is it possible that two Nodes might have differing opinions on the health of the same device? I'm wondering if this KEP would need to define a way to express the health of a device with respect to each Node that could attach it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the DRA driver is the source of truth for device health here, not the Kubelet. In the architecture I implemented for KEP-4680, the kubelet acts as a client that consumes health status streamed from the node-local DRA plugin via the DRAResourceHealth
gRPC service. This design inherently handles the possibility of differing health perspectives between nodes(although, I don't see how there could be a legitimate discrepancy of the same device health between nodes). Since a ResourceSlice
is published by the DRA driver running on a specific node, the health status it contains would naturally reflect the device's condition from that node's perspective.
Example assuming the device healths are used to populate ResourceSlice device statuses:
- If a network attached device is experiencing issues from Node A, the DRA driver on Node A would reoprt it as
Unhealthy
in theResourceSlice
for that node. - Simultaneously, if the same device is accessible from Node B, the driver on Node B would report it as
Healthy
in itsResourceSlice
.
Although, I think this would be odd, it shows that the current model should account for this scenario where one node has the same device as healthy and another as unhealthy.
@SergeyKanzhelev, please correct me if I am wrong, but to the best of my understanding this is how the device health works with DRA now.
some action to restore the device to a healthy state. This KEP defines a | ||
standard way to determine whether or not a device is considered healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a bigger scope than the KEP title.
some action to restore the device to a healthy state. This KEP defines a | |
standard way to determine whether or not a device is considered healthy. | |
Some action is r to restore the device to a healthy state. This KEP proposes | |
A new entry to the ResourceSlice object to allow DRA drivers to report whether or not a device is considered healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer to change the title then if this statement is too far off from it.
@johnbelamaric Would it be appropriate to retitle #5283 something like "DRA: Device Health Status" that doesn't imply any particular solution like adding a new status
field in ResourceSlice but specific enough to differentiate it from #4680?
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is | ||
made by each DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This non-goal collides with what you said on the summary
section, line 184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This non-goal is only saying that Kubernetes doesn't care about the underlying characteristics of a device that cause a driver to consider it healthy or not. The summary says cluster administrators are interested in identifying and remediating unhealthy devices. Are those at odds with each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP describes where health information can be found and its general structure. DRA drivers populate that health information in the API. Cluster admins use that to help identify and remediate issues.
I don't see where any of that conflicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that in addition to non-standard information admin requires to actually do something about the health issue, there would be a standard health flag, which admin would monitor to see whether there's a need to look further?
This immediately raises the question that who then decides and configures which conditions trigger such flag.
Because if the flag is raised on things that are irrelevant for the admin, or it's not raised on things that admin cares about, it's not really helping, admin would need to follow the non-standard info anyway.
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
} | ||
|
||
// DeviceHealthStatus represents the health of a device as observed by the driver. | ||
type DeviceHealthStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR kubernetes/kubernetes#130606 introduced
type DeviceHealthStatus string
const (
// DeviceHealthStatusHealthy represents a healthy device.
DeviceHealthStatusHealthy DeviceHealthStatus = "Healthy"
// DeviceHealthStatusUnhealthy represents an unhealthy device.
DeviceHealthStatusUnhealthy DeviceHealthStatus = "Unhealthy"
// DeviceHealthStatusUnknown represents a device with unknown health status.
DeviceHealthStatusUnknown DeviceHealthStatus = "Unknown"
)
Add alternative for vendor-provided metrics
This is still technically "in-progress" in that it's not ready to merge right now, but I'm ready for early feedback on what's there now to help me fill out the rest of the KEP. Removing WIP: |
[taint](https://kep.k8s.io/5055) the devices. A standard representation of | ||
device health in the ResourceSlice API is needed to express the state of the | ||
devices in cases where the side effects (`NoSchedule`/`NoExecute`) of taints are | ||
not desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should state what else than NoSchedule
/NoExecute
is needed, something that has no effect and is just informational?
If yes, that's what the cluster device telemetry, and metric alerts are for...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section only mentions taints and tolerations as the most closely related mechanism at the moment to describe health. Proposing modifications to taints is better fit for the Proposal or Alternatives sections than the Summary section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I meant is that this (or the new "No Side-Effects By Default" section) does not explain the motivation for this KEP. I.e. why a no-effect health flag is desirable (especially one with no details of what are the actual health issues).
As a cluster administrator, I want to determine the overall health of the | ||
DRA-exposed devices available throughout my cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could done with device taints and a tool listing them (e.g. listing all device taints, or ones matching given pattern for nodes, devices or taints). Or if alerts are generated for those taints, viewing them from the alertmanager GUI.
=> User story needs to be something where taints are not enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to change this story for the purpose of disqualifying taints if taints are otherwise an acceptable solution. The Motivation section does describe the need for a solution without side-effects though, so I'll add a new story for that to avoid any single story from describing too many things at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was because motivation section states need for purely informational health flag, but does not state why it's needed, by whom, or what's the use-case for such no-effect health information.
(If device is still supposed to be used by all potential workloads, it doesn't seem particularly unhealthy?)
- Define what constitutes a "healthy" or "unhealthy" device. That distinction is | ||
made by each DRA driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO those clearly conflict. To remediate, one needs to know what are the specific issues and their root causes.
#### Enabling Automated Remediation | ||
|
||
As a cluster administrator, I want to determine how to remediate unhealthy | ||
devices when different failure modes require different methods of remediation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before one can determine how to remediate an issue, one needs to know what are the issues.
I think both of those are rather device specific, but it may be possible to come up with some broad categories, and how those might be remediated:
- Device running too hot / fan not working
- Taint device to reduce its load
- Notify admin to check cooling
- Device memory (ECC) errors
- If recurring non-recoverable ones, taint device and notify admin to check/replace memory
- Device power delivery issues
- Taint device to reduce its load
- Notify admin to check PSU
- Device power usage throttling
- If frequent, taint device to reduce its load and notify admin to check device FW power limits
- Overuse of shared device or workload OOMs
- Taint device to reduce its load
- If recurring frequently, notify admin to check workload resource requests
- Device link quality / stability issues
- Prefer devices with better link quality => resource request should specify required link BW
- If severe enough, ban multi-device workloads and notify admin to investigate
- Specific workload hangs / increased user-space driver error counters
- Stop scheduling that workload (it may use buggy user-space driver or use that inccorrectly)
- Alert admin / dev to investigate that workload
- Old / buggy device FW
- If there's a set of workloads that work, and do not work correctly with that, use taints
- Schedule FW upgrade, and taint device during upgrade
- Device hangs / increased kernel driver / FW / HW error counters
- Reset specific device part (e.g. compute)
- Drain device and reset it
- With too many device resets / error increases, taint device and alert admin
- Drain all devices on same bus and reset bus
- Drain whole node and reset it
- Schedule device firmware update
- Schedule device replacement
- (First ones can be done by (kernel) driver automatically, last ones require admin)
keps/sig-node/5283-dra-resourceslice-status-device-health/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/5283-dra-resourceslice-status-device-health/README.md
Outdated
Show resolved
Hide resolved
The main cost of that flexibility is the lack of standardization, where cluster | ||
administrators have to track down from each vendor how to determine if a given | ||
device is in a healthy state as opposed to inspecting a well-defined area of a | ||
vendor-agnostic API like ResourceSlice. This lack of standardization also makes | ||
integrations like generic controllers that automatically taint unhealthy devices | ||
less straightforward to implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a OpenTelemetry standard for the metrics: https://opentelemetry.io/docs/specs/semconv/
(One of the goals of that standardizations is providing e.g. drill-down support from whole node power usage, to power usage of individual components inside that.)
Admittedly it's still a rather WIP in regards to health related device metrics: https://opentelemetry.io/docs/specs/semconv/system/hardware-metrics/
See my list above and e.g:
- hw.host.power/energy versus hw.power/energy metrics open-telemetry/semantic-conventions#1055
- Issues with Hardware Metrics semantic conventions open-telemetry/semantic-conventions#940
Device telemetry stacks provided by the vendors most likely haven't adopted it yet either...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with a standard way to determine certain values like fan speed or battery level, vendors need to document what those mean w.r.t. how healthy a device is. I think that's an acceptable way to consider implementing this KEP, but is a step down in some ways to including an overall "healthy"/"unhealthy" signal that could be identical for every kind of device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some information / metrics can be rather self-evident (e.g. fan failed). As to rest of metrics, you may have somewhat optimistic view of how much vendor (k8s driver developers) know of their health impact.
How given set of (less obvious) metrics matches to long term health of given device, and at what probability over what time interval, is information that's more likely to be in possession of large cluster operators and their admins.
(HW vendors do not constantly run production workloads in huge clusters, and collect metrics & health statistics of their working, their customer do that, and I suspect they're unlikely to share that info with anybody, even their HW vendor, except to fix specific issues, maybe just for specific team / persons.)
Add user story for purely informational status
Simplify wording in metrics alternative
/cc @johnbelamaric