Skip to content

Commit 4b35dea

Browse files
authored
Merge pull request #5302 from Jpsassine/patch-1
KEP 4680: Update README.md for DRA
2 parents afc3073 + b1dc962 commit 4b35dea

File tree

1 file changed

+121
-61
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+121
-61
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 121 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
- [Design Details](#design-details)
1616
- [Device Plugin implementation details](#device-plugin-implementation-details)
1717
- [DRA implementation details](#dra-implementation-details)
18+
- [High-Level Architectural Approach for DRA Health](#high-level-architectural-approach-for-dra-health)
19+
- [gRPC API for DRA Device Health](#grpc-api-for-dra-device-health)
1820
- [Test Plan](#test-plan)
1921
- [Prerequisite testing updates](#prerequisite-testing-updates)
2022
- [Unit tests](#unit-tests)
@@ -285,54 +287,92 @@ We should consider introducing another field to the Status that will be a free f
285287

286288
### DRA implementation details
287289

288-
Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
289-
type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in
290-
the Device Plugin as well as a device ID.
291-
292-
In `1.30` we had a similar `ListAndWatch()` API as in DevicePlugin, from which we could have inferred something very analogous to the above. However, we are removing this in `1.31`, so will need to provide something different.
293-
294-
An optional gRPC interface will be created, so DRA drivers can opt into this by implementing it. The interface will allow a plugin to stream health status information in the form of deviceIDs (of the form `<driver name>/<pool name>/<device name>`) along with extra metadata indicating its health status. Just as before, a device completely disappearing would still need to trigger some state change, but now more detailed information could be attached in the form of metadata when a device isn't necessarily gone, but also isn't operating as it should be.
295-
296-
The API will be limited to "prepared" devices and include the claim `name/namespace/UID`. That should be enough information for kubelet to correlate with the pods for which the claim was prepared and then post that information for those pods.
297-
298-
Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
299-
300-
The new method will be added to the same gRPC server that serves the [Node service](https://github.com/kubernetes/kubernetes/blob/04bba3c222bb2c5b1b1565713de4bf334ee7fbe4/staging/src/k8s.io/kubelet/pkg/apis/dra/v1alpha4/api.proto#L34) interface (Node service exposes
301-
`NodePrepareResources` and `NodeUnprepareResources`). The new interface will have a `Device` structure similar to Node Service's device, with the added `health` field:
302-
303-
``` proto
290+
Today DRA does not return the health of the device back to kubelet. In `1.30`
291+
we had a `ListAndWatch()` API similar to DevicePlugin, from which we could
292+
have inferred device health. However, this API is being removed in `1.31`,
293+
necessitating a new approach for health monitoring.
294+
295+
The following design outlines how Kubelet will obtain health information
296+
from DRA plugins and use it to update the PodStatus. This design focuses on an
297+
optional, proactive health reporting mechanism from DRA plugins.
298+
299+
#### High-Level Architectural Approach for DRA Health
300+
301+
1. **Optional gRPC Stream:** A new, optional gRPC service for health monitoring
302+
will be defined. DRA plugins can implement this service to proactively send
303+
health updates for their managed devices to Kubelet. It will expose a
304+
server-streaming RPC that allows the plugin to send a complete list of
305+
device health states whenever a change occurs. If a plugin does not
306+
implement this service, the health of its devices will be reported as "Unknown".
307+
308+
2. **Health Information Cache:** Kubelet's DRA Manager will maintain a
309+
persistent cache of device health information. This cache will store the
310+
latest known health status (e.g., Healthy, Unhealthy, Unknown) and a
311+
timestamp for each device, keyed by driver and device identifiers. The cache
312+
will be responsible for reconciling the state reported by the plugin, handling
313+
timeouts for stale data (marking devices as "Unknown" if not updated
314+
within a certain period), and persisting this information across Kubelet restarts.
315+
316+
3. **Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
317+
Upon plugin registration, it will attempt to initiate the health monitoring
318+
stream. If successful, it will consume the health updates, update its
319+
internal health cache, and identify which Pods are affected by any
320+
reported health changes. For seamless plugin upgrades, where multiple
321+
instances of a plugin might run concurrently, the Kubelet will always
322+
watch the most recently registered plugin for health updates.
323+
324+
4. **PodStatus Update:** When health changes for a device are detected, the DRA manager
325+
will trigger an update for the affected Pods. Kubelet's main pod synchronization
326+
logic will then read the current health status for the Pod's allocated DRA devices
327+
from the health cache and populate the `AllocatedResourcesStatus` field in the
328+
PodStatus with the correct health information.
329+
330+
*Note: Kubelet will only use this health information to update the Pod
331+
Status. The DRA plugin remains responsible for other actions, such as tainting
332+
ResourceSlices to prevent scheduling on unhealthy resources.*
333+
334+
#### gRPC API for DRA Device Health
335+
336+
A new gRPC service, `NodeHealth`, will be introduced in a new API group (e.g., `dra-health/v1alpha1`) to keep it separate from the core DRA API and signify its optionality.
337+
338+
The service will define a `WatchResources` RPC:
339+
340+
```proto
304341
service NodeHealth {
305-
...
306-
307-
// WatchDevicesStatus returns a stream of List of Devices
308-
// Whenever a Device state change or a Device disappears, WatchDevicesStatus
309-
// returns the new list.
310-
// This method is optional and may not be implemented.
311-
rpc WatchDevicesStatus(Empty) returns (stream DevicesStatusResponse) {}
342+
// WatchResources allows a DRA plugin to stream health updates for its devices to Kubelet.
343+
// Kubelet calls this method, and the plugin streams responses.
344+
// This method is optional; if not implemented by a plugin, Kubelet will assume
345+
// devices managed by that plugin have an "Unknown" health status.
346+
rpc WatchResources(WatchResourcesRequest) returns (stream WatchResourcesResponse) {}
312347
}
313348
314-
// ListAndWatch returns a stream of List of Devices
315-
// Whenever a Device state change or a Device disappears, ListAndWatch
316-
// returns the new list
317-
message DevicesStatusResponse {
318-
repeated Device devices = 1;
349+
message WatchResourcesRequest {
350+
// Reserved for future use, e.g., filtering or options.
319351
}
320352
321-
message Device {
322-
... existing fields ...
323-
// The device itself. Required.
324-
string device_name = 3;
325-
... existing fields ...
353+
message WatchResourcesResponse {
354+
// A list of all devices managed by the plugin for which health is being reported.
355+
// This should be a complete list for the driver; Kubelet will reconcile this state.
356+
repeated DeviceHealth devices = 1;
357+
}
326358
327-
// Health of the device, can be Healthy or Unhealthy.
328-
string Health = 5;
359+
message DeviceHealth {
360+
// The name of the resource pool this device belongs to.
361+
// Required.
362+
string pool_name = 1;
363+
// The unique name of the device within the pool.
364+
// Required.
365+
string device_name = 2;
366+
// Health status of the device.
367+
// Expected values: "Healthy", "Unhealthy", "Unknown".
368+
// Required.
369+
string health_status = 3;
370+
// Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
371+
// Required.
372+
int64 last_updated_timestamp = 4;
329373
}
330374
```
331375

332-
Implementation will ignore the `Unimplemented` error when the DRA plugin doesn't have this interface implemented.
333-
334-
Note, the gRPC details are still a subject to change and will go thru API review during the implementation.
335-
336376
### Test Plan
337377

338378
[X] I/we understand the owners of the involved components may require updates to
@@ -341,38 +381,58 @@ to implement this enhancement.
341381

342382
##### Prerequisite testing updates
343383

344-
Device Plugin and DRA are relatively new features and have a reasonable test coverage.
384+
The existing test coverage for Device Manager and DRA will be used as a baseline. New code introduced by this KEP will include thorough unit tests to maintain or improve coverage.
345385

346386
##### Unit tests
347387

348-
- `k8s.io/kubernetes/pkg/kubelet/cm/devicemanager`: `5/31/2024` - `84.1`
349-
- `k8s.io/kubernetes/pkg/kubelet/cm/dra`: `5/31/2024` - `59.2`
350-
- `k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin`: `5/31/2024` - `34`
351-
- `k8s.io/kubernetes/pkg/kubelet/cm/dra/state`: `5/31/2024` - `98`
388+
Current coverage for the relevant packages (as of June 2025):
389+
- `k8s.io/kubernetes/pkg/kubelet/cm/devicemanager`: `84.8%`
390+
- `k8s.io/kubernetes/pkg/kubelet/cm/dra`: `79.8%`
391+
- `k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin`: `84.0%`
392+
- `k8s.io/kubernetes/pkg/kubelet/cm/dra/state`: `46.2%`
393+
394+
The new DRA health monitoring logic will have thorough unit test coverage, including:
395+
396+
- **Health Information Cache Logic:**
397+
- Cache initialization from scratch and from a checkpoint file.
398+
- State reconciliation of device health based on plugin reports.
399+
- Correct handling of `LastUpdated` timestamps.
400+
- Marking devices as "Unknown" after a timeout period.
401+
- Correctly identifying which devices have changed health status.
402+
- Accurate retrieval of health status for existing, timed-out, and non-existent devices.
403+
- Proper cleanup of a driver's health data upon its deregistration.
404+
- Persistence logic for saving to and loading from the checkpoint file.
405+
- **Plugin Registration and gRPC Stream Handling:**
406+
- Verification of successful health stream startup and background processing.
407+
- Graceful handling of plugins that do not implement the health monitoring service (`Unimplemented` error).
408+
- Correct cancellation of the health stream when a plugin is replaced or deregistered.
409+
- Error handling during stream initiation and message reception.
410+
- **DRA Manager Logic:**
411+
- Correct processing of health update messages from the gRPC stream.
412+
- Accurate identification of Pods affected by a health change.
413+
- Properly sending update notifications for affected Pods.
414+
- Correct population of the `AllocatedResourcesStatus` field in the Pod's status object.
352415

353416
##### Integration tests
354417

355418
N/A
356419

357420
##### e2e tests
358421

359-
Planned tests:
360-
361-
- Device marked unhealthy - the state is reflected in pod status
362-
- Device marked unhealthy and back to healthy after some time - pod status was changed to unhealthy temporarily
363-
- Device marked as unhealthy and back to healthy in quick succession - pod status reflects the latest health status
364-
- Pod failed due to unhealthy device, earlier than device plugin detected it. Pod status is still updated.
365-
- Pod is in crash loop backoff due to unhealthy device - pod status is updated to unhealthy
366-
367-
For alpha rollout and rollback:
368-
369-
- Fields dropped on update when feature gate is disabled
370-
- Field is not populated after the feature gate is disabled
371-
- Field is populated again when the feature gate is enabled
372-
373-
Test coverage will be listed once tests are implemented.
374-
375-
- <test>: <link to test coverage>
422+
Planned tests will cover the user-visible behavior of the feature:
423+
424+
- **Basic Health Reporting:**
425+
- Verify that when a DRA plugin reports a device as unhealthy, the PodStatus is updated to reflect this.
426+
- Verify that when the device becomes healthy again, the PodStatus is correctly updated.
427+
- **State Transitions:**
428+
- Test rapid health state changes (e.g., unhealthy to healthy and back) to ensure the final PodStatus reflects the latest state.
429+
- **Failure Scenarios:**
430+
- Ensure that if a Pod fails *before* the plugin detects the unhealthy device, the PodStatus is still updated with the health information afterward.
431+
- Verify that a Pod in a `CrashLoopBackOff` state due to an unhealthy device correctly shows the device's unhealthy status.
432+
- **Feature Gate Behavior (for Alpha):**
433+
- When the feature gate is disabled, verify that the `AllocatedResourcesStatus` field is not populated by the DRA manager.
434+
- When the feature gate is disabled on an existing cluster, verify that existing health information is gracefully ignored or removed on the next Pod update.
435+
- When the feature gate is re-enabled, verify that health reporting resumes correctly.
376436

377437
### Graduation Criteria
378438

0 commit comments

Comments
 (0)