Skip to content

Commit 2d6e1d4

Browse files
committed
Add mention about scheduler concerns and PreBind timeout
1 parent 726f0c9 commit 2d6e1d4

File tree

2 files changed

+19
-5
lines changed

2 files changed

+19
-5
lines changed

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@ tags, and then generate with `hack/update-toc.sh`.
9292
- [DRA Scheduler Plugin Design Overview](#dra-scheduler-plugin-design-overview)
9393
- [Device Attribute Additions](#device-attribute-additions)
9494
- [<code>AllocatedDeviceStatus</code> Additions](#allocateddevicestatus-additions)
95+
- [PreBind Phase Timeout](#prebind-phase-timeout)
9596
- [Handling ResourceSlices Upon Failure of Attachment](#handling-resourceslices-upon-failure-of-attachment)
9697
- [Composable Controller Design Overview](#composable-controller-design-overview)
9798
- [Test Plan](#test-plan)
@@ -354,13 +355,23 @@ const(
354355
)
355356
```
356357

357-
When `kubernetes.io/needs-attaching: true` is set, the scheduler is expected to do the following:
358+
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following:
358359

359360
1. Set `AllocatedDeviceStatus.NodeName`.
360361
2. Add an `AllocatedDeviceStatus` with a condition of `Type: kubernetes.io/needs-attaching` and `Status: True`.
361362
3. Wait for a condition with `Type: kubernetes.io/is-attached` and `Status: True` in `PreBind` before proceeding.
362363
4. Give up when seeing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
363364

365+
Note: There is a concern that the in-flight events cache may grow too large when waiting in PreBind.
366+
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
367+
368+
#### PreBind Phase Timeout
369+
370+
If the device attachment is successful, we expect it to take no longer than 5 minutes.
371+
Therefore, if we set a fixed timeout for the scheduler, we would like to set it to 10 minutes.
372+
373+
Even if the conditions `Type: kubernetes.io/is-attached` or `Type: kubernetes.io/attach-failed` are not updated, setting a timeout will prevent the scheduler from waiting indefinitely in the PreBind phase.
374+
364375
#### Handling ResourceSlices Upon Failure of Attachment
365376

366377
During the scheduling cycle, the DRA plugin reserves a `ResourceSlice` for the `ResourceClaim`.
@@ -370,9 +381,9 @@ If a fabric device is selected, the scheduler waits for the device attachment du
370381
The composable controller performs the attachment operation by checking the flag of the `ResourceClaim`.
371382
If the attachment fails, the following steps are taken:
372383

373-
1. **Update ResourceClaim**: The composable controller updates the `ResourceClaim` to indicate the failure of the attachment by setting a condition with `Type: kubernetes.io/is-attached` and `Status: False`.
384+
1. **Update ResourceClaim**: The composable controller updates the `AllocatedDeviceStatus` to indicate the failure of the attachment by setting a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
374385
2. **Fail the Binding Cycle**: The scheduler detects the failed attachment condition and fails the binding cycle. This prevents the pod from proceeding with an unattached device.
375-
3. **Unbind ResourceClaim and ResourceSlice**: The scheduler unbinds the `ResourceClaim` and `ResourceSlice`, clearing the allocation to prevent the fabric device from being used in the pool.
386+
3. **Unbind ResourceClaim and ResourceSlice**: The scheduler DRA plugin unbinds the `ResourceClaim` and `ResourceSlice` in `Unreserve`, clearing the allocation to prevent the fabric device from being used in the `ResourceClaim`.
376387
4. **Retry Scheduling**: In the next scheduling cycle, the scheduler attempts to bind the `ResourceClaim` again.
377388

378389
### Composable Controller Design Overview

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,14 @@ owning-sig: sig-scheduling
88
# - sig-bbb
99
status: implementable
1010
#|implemented|deferred|rejected|withdrawn|replaced
11-
creation-date: 2025-02-03
11+
creation-date: 2025-02-04
1212
reviewers:
1313
- "@pohly"
14+
- "@dom4ha"
15+
- "@macsko"
16+
- "@sanposhiho"
1417
approvers:
15-
- TBD
18+
- "@alculquicondor"
1619

1720
see-also:
1821
- "/keps/sig-node/4381-dra-structured-parameters"

0 commit comments

Comments
 (0)