You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[KEP-4680] Add configurable health check timeout to DeviceHealth gRPC message
- Added health_check_timeout_seconds field to DeviceHealth message
- Updated documentation to reflect that timeout is now configurable per device
- Changed Beta graduation criteria from 'implement' to 'verify' since feature is now included in initial design
- Addresses PR feedback about DRA API for timeout configuration
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@@ -310,6 +311,13 @@ optional, proactive health reporting mechanism from DRA plugins.
310
311
will be responsible for reconciling the state reported by the plugin, handling
311
312
timeouts for stale data (marking devices as "Unknown" if not updated
312
313
within a certain period), and persisting this information across Kubelet restarts.
314
+
315
+
**Note:** The timeout for marking a device's health as "Unknown" can be
316
+
configured per device via the `health_check_timeout_seconds` field in the
317
+
`DeviceHealth` message. If not specified, Kubelet will use a default timeout
318
+
of 30 seconds. This addresses [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118),
319
+
allowing different hardware types (e.g., GPUs, FPGAs, TPUs, storage) to specify
320
+
appropriate timeout values based on their health-reporting characteristics.
313
321
314
322
3.**Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
315
323
Upon plugin registration, it will attempt to initiate the health monitoring
@@ -368,6 +376,10 @@ message DeviceHealth {
368
376
// Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
369
377
// Required.
370
378
int64 last_updated_timestamp = 4;
379
+
// Health check timeout duration in seconds for this device.
380
+
// If not specified or zero, Kubelet will use a default timeout.
381
+
// Optional.
382
+
int64 health_check_timeout_seconds = 5;
371
383
}
372
384
```
373
385
@@ -448,6 +460,7 @@ Planned tests will cover the user-visible behavior of the feature:
448
460
#### Beta
449
461
450
462
- Complete e2e tests coverage
463
+
- Verify configurable device health check timeout implementation works correctly across different plugin vendors (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
0 commit comments