Skip to content

Commit 8279c43

Browse files
[KEP-4680] Add configurable health check timeout to DeviceHealth gRPC message
- Added health_check_timeout_seconds field to DeviceHealth message - Updated documentation to reflect that timeout is now configurable per device - Changed Beta graduation criteria from 'implement' to 'verify' since feature is now included in initial design - Addresses PR feedback about DRA API for timeout configuration Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1 parent 9177d4a commit 8279c43

File tree

1 file changed

+19
-6
lines changed
  • keps/sig-node/4680-add-resource-health-to-pod-status

1 file changed

+19
-6
lines changed

keps/sig-node/4680-add-resource-health-to-pod-status/README.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -259,12 +259,13 @@ We may consider this as a future improvement.
259259

260260
### Notes/Constraints/Caveats (Optional)
261261

262-
<!--
263-
What are the caveats to the proposal?
264-
What are some important details that didn't come across above?
265-
Go in to as much detail as necessary here.
266-
This might be a good place to talk about core concepts and how they relate.
267-
-->
262+
- **DRA Device Health Timeout Configuration:** The timeout for marking a DRA device's health as "Unknown"
263+
when no updates are received can be configured per device through the `health_check_timeout_seconds` field
264+
in the `DeviceHealth` message. This allows different hardware types (e.g., GPUs, FPGAs, TPUs, storage devices)
265+
to specify appropriate timeout values based on their health-reporting characteristics. If not specified,
266+
Kubelet will use a default timeout of 30 seconds. This addresses
267+
[Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118) and the discussion in
268+
[PR #130606](https://github.com/kubernetes/kubernetes/pull/130606/files#r2221829511).
268269

269270
### Risks and Mitigations
270271

@@ -310,6 +311,13 @@ optional, proactive health reporting mechanism from DRA plugins.
310311
will be responsible for reconciling the state reported by the plugin, handling
311312
timeouts for stale data (marking devices as "Unknown" if not updated
312313
within a certain period), and persisting this information across Kubelet restarts.
314+
315+
**Note:** The timeout for marking a device's health as "Unknown" can be
316+
configured per device via the `health_check_timeout_seconds` field in the
317+
`DeviceHealth` message. If not specified, Kubelet will use a default timeout
318+
of 30 seconds. This addresses [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118),
319+
allowing different hardware types (e.g., GPUs, FPGAs, TPUs, storage) to specify
320+
appropriate timeout values based on their health-reporting characteristics.
313321

314322
3. **Kubelet Integration:** The DRA Manager in Kubelet will act as the gRPC client.
315323
Upon plugin registration, it will attempt to initiate the health monitoring
@@ -368,6 +376,10 @@ message DeviceHealth {
368376
// Timestamp of when this health status was last determined by the plugin, as a Unix timestamp (seconds).
369377
// Required.
370378
int64 last_updated_timestamp = 4;
379+
// Health check timeout duration in seconds for this device.
380+
// If not specified or zero, Kubelet will use a default timeout.
381+
// Optional.
382+
int64 health_check_timeout_seconds = 5;
371383
}
372384
```
373385

@@ -448,6 +460,7 @@ Planned tests will cover the user-visible behavior of the feature:
448460
#### Beta
449461

450462
- Complete e2e tests coverage
463+
- Verify configurable device health check timeout implementation works correctly across different plugin vendors (see [Issue #133118](https://github.com/kubernetes/kubernetes/issues/133118))
451464

452465
#### GA
453466

0 commit comments

Comments
 (0)