You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -285,54 +287,92 @@ We should consider introducing another field to the Status that will be a free f
285
287
286
288
### DRA implementation details
287
289
288
-
Today DRA does not return the health of the device back to kubelet. The proposal is to extend the
289
-
type `NamedResourcesInstance` (from [pkg/apis/resource/namedresources.go](https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/apis/resource/namedresources.go#L29-L37)) to include the Health field the same way it is done in
290
-
the Device Plugin as well as a device ID.
291
-
292
-
In `1.30` we had a similar `ListAndWatch()` API as in DevicePlugin, from which we could have inferred something very analogous to the above. However, we are removing this in `1.31`, so will need to provide something different.
293
-
294
-
An optional gRPC interface will be created, so DRA drivers can opt into this by implementing it. The interface will allow a plugin to stream health status information in the form of deviceIDs (of the form `<driver name>/<pool name>/<device name>`) along with extra metadata indicating its health status. Just as before, a device completely disappearing would still need to trigger some state change, but now more detailed information could be attached in the form of metadata when a device isn't necessarily gone, but also isn't operating as it should be.
295
-
296
-
The API will be limited to "prepared" devices and include the claim `name/namespace/UID`. That should be enough information for kubelet to correlate with the pods for which the claim was prepared and then post that information for those pods.
297
-
298
-
Kubelet will react on this field the same way as we propose to do it for the Device Plugin.
299
-
300
-
The new method will be added to the same gRPC server that serves the [Node service](https://github.com/kubernetes/kubernetes/blob/04bba3c222bb2c5b1b1565713de4bf334ee7fbe4/staging/src/k8s.io/kubelet/pkg/apis/dra/v1alpha4/api.proto#L34) interface (Node service exposes
301
-
`NodePrepareResources` and `NodeUnprepareResources`). The new interface will have a `Device` structure similar to Node Service's device, with the added `health` field:
302
-
303
-
```proto
290
+
Today DRA does not return the health of the device back to kubelet. In `1.30`
291
+
we had a `ListAndWatch()` API similar to DevicePlugin, from which we could
292
+
have inferred device health. However, this API is being removed in `1.31`,
293
+
necessitating a new approach for health monitoring.
294
+
295
+
The following design outlines how Kubelet will obtain health information
296
+
from DRA plugins and use it to update the PodStatus. This design focuses on an
297
+
optional, proactive health reporting mechanism from DRA plugins.
298
+
299
+
#### High-Level Architectural Approach for DRA Health
300
+
301
+
1.**Optional gRPC Stream:** A new, optional gRPC service for health monitoring
302
+
will be defined. DRA plugins can implement this service to proactively send
303
+
health updates for their managed devices to Kubelet. It will expose a
304
+
server-streaming RPC that allows the plugin to send a complete list of
305
+
device health states whenever a change occurs. If a plugin does not
306
+
implement this service, the health of its devices will be reported as "Unknown".
307
+
308
+
2.**Health Information Cache:** Kubelet's DRA Manager will maintain a
309
+
persistent cache of device health information. This cache will store the
310
+
latest known health status (e.g., Healthy, Unhealthy, Unknown) and a
311
+
timestamp for each device, keyed by driver and device identifiers. The cache
312
+
will be responsible for reconciling the state reported by the plugin, handling
313
+
timeouts for stale data (marking devices as "Unknown" if not updated
314
+
within a certain period), and persisting this information across Kubelet restarts.
315
+
316
+
3.**Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
317
+
Upon plugin registration, it will attempt to initiate the health monitoring
318
+
stream. If successful, it will consume the health updates, update its
319
+
internal health cache, and identify which Pods are affected by any
320
+
reported health changes. For seamless plugin upgrades, where multiple
321
+
instances of a plugin might run concurrently, the Kubelet will always
322
+
watch the most recently registered plugin for health updates.
323
+
324
+
4.**PodStatus Update:** When health changes for a device are detected, the DRA manager
325
+
will trigger an update for the affected Pods. Kubelet's main pod synchronization
326
+
logic will then read the current health status for the Pod's allocated DRA devices
327
+
from the health cache and populate the `AllocatedResourcesStatus` field in the
328
+
PodStatus with the correct health information.
329
+
330
+
*Note: Kubelet will only use this health information to update the Pod
331
+
Status. The DRA plugin remains responsible for other actions, such as tainting
332
+
ResourceSlices to prevent scheduling on unhealthy resources.*
333
+
334
+
#### gRPC API for DRA Device Health
335
+
336
+
A new gRPC service, `NodeHealth`, will be introduced in a new API group (e.g., `dra-health/v1alpha1`) to keep it separate from the core DRA API and signify its optionality.
337
+
338
+
The service will define a `WatchResources` RPC:
339
+
340
+
```proto
304
341
service NodeHealth {
305
-
...
306
-
307
-
// WatchDevicesStatus returns a stream of List of Devices
308
-
// Whenever a Device state change or a Device disappears, WatchDevicesStatus
309
-
// returns the new list.
310
-
// This method is optional and may not be implemented.
// Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
371
+
// Required.
372
+
int64 last_updated_timestamp = 4;
329
373
}
330
374
```
331
375
332
-
Implementation will ignore the `Unimplemented` error when the DRA plugin doesn't have this interface implemented.
333
-
334
-
Note, the gRPC details are still a subject to change and will go thru API review during the implementation.
335
-
336
376
### Test Plan
337
377
338
378
[X] I/we understand the owners of the involved components may require updates to
@@ -341,38 +381,58 @@ to implement this enhancement.
341
381
342
382
##### Prerequisite testing updates
343
383
344
-
Device Plugin and DRA are relatively new features and have a reasonable test coverage.
384
+
The existing test coverage for Device Manager and DRA will be used as a baseline. New code introduced by this KEP will include thorough unit tests to maintain or improve coverage.
The new DRA health monitoring logic will have thorough unit test coverage, including:
395
+
396
+
-**Health Information Cache Logic:**
397
+
- Cache initialization from scratch and from a checkpoint file.
398
+
- State reconciliation of device health based on plugin reports.
399
+
- Correct handling of `LastUpdated` timestamps.
400
+
- Marking devices as "Unknown" after a timeout period.
401
+
- Correctly identifying which devices have changed health status.
402
+
- Accurate retrieval of health status for existing, timed-out, and non-existent devices.
403
+
- Proper cleanup of a driver's health data upon its deregistration.
404
+
- Persistence logic for saving to and loading from the checkpoint file.
405
+
-**Plugin Registration and gRPC Stream Handling:**
406
+
- Verification of successful health stream startup and background processing.
407
+
- Graceful handling of plugins that do not implement the health monitoring service (`Unimplemented` error).
408
+
- Correct cancellation of the health stream when a plugin is replaced or deregistered.
409
+
- Error handling during stream initiation and message reception.
410
+
-**DRA Manager Logic:**
411
+
- Correct processing of health update messages from the gRPC stream.
412
+
- Accurate identification of Pods affected by a health change.
413
+
- Properly sending update notifications for affected Pods.
414
+
- Correct population of the `AllocatedResourcesStatus` field in the Pod's status object.
352
415
353
416
##### Integration tests
354
417
355
418
N/A
356
419
357
420
##### e2e tests
358
421
359
-
Planned tests:
360
-
361
-
- Device marked unhealthy - the state is reflected in pod status
362
-
- Device marked unhealthy and back to healthy after some time - pod status was changed to unhealthy temporarily
363
-
- Device marked as unhealthy and back to healthy in quick succession - pod status reflects the latest health status
364
-
- Pod failed due to unhealthy device, earlier than device plugin detected it. Pod status is still updated.
365
-
- Pod is in crash loop backoff due to unhealthy device - pod status is updated to unhealthy
366
-
367
-
For alpha rollout and rollback:
368
-
369
-
- Fields dropped on update when feature gate is disabled
370
-
- Field is not populated after the feature gate is disabled
371
-
- Field is populated again when the feature gate is enabled
372
-
373
-
Test coverage will be listed once tests are implemented.
374
-
375
-
- <test>: <linktotestcoverage>
422
+
Planned tests will cover the user-visible behavior of the feature:
423
+
424
+
-**Basic Health Reporting:**
425
+
- Verify that when a DRA plugin reports a device as unhealthy, the PodStatus is updated to reflect this.
426
+
- Verify that when the device becomes healthy again, the PodStatus is correctly updated.
427
+
-**State Transitions:**
428
+
- Test rapid health state changes (e.g., unhealthy to healthy and back) to ensure the final PodStatus reflects the latest state.
429
+
-**Failure Scenarios:**
430
+
- Ensure that if a Pod fails *before* the plugin detects the unhealthy device, the PodStatus is still updated with the health information afterward.
431
+
- Verify that a Pod in a `CrashLoopBackOff` state due to an unhealthy device correctly shows the device's unhealthy status.
432
+
-**Feature Gate Behavior (for Alpha):**
433
+
- When the feature gate is disabled, verify that the `AllocatedResourcesStatus` field is not populated by the DRA manager.
434
+
- When the feature gate is disabled on an existing cluster, verify that existing health information is gracefully ignored or removed on the next Pod update.
435
+
- When the feature gate is re-enabled, verify that health reporting resumes correctly.
0 commit comments