You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When `kubernetes.io/needs-attaching: true` is set, the scheduler is expected to do the following:
358
+
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following:
358
359
359
360
1. Set `AllocatedDeviceStatus.NodeName`.
360
361
2. Add an `AllocatedDeviceStatus` with a condition of `Type: kubernetes.io/needs-attaching` and `Status: True`.
361
362
3. Wait for a condition with `Type: kubernetes.io/is-attached` and `Status: True` in `PreBind` before proceeding.
362
363
4. Give up when seeing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
363
364
365
+
Note: There is a concern that the in-flight events cache may grow too large when waiting in PreBind.
366
+
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
367
+
368
+
#### PreBind Phase Timeout
369
+
370
+
If the device attachment is successful, we expect it to take no longer than 5 minutes.
371
+
Therefore, if we set a fixed timeout for the scheduler, we would like to set it to 10 minutes.
372
+
373
+
Even if the conditions `Type: kubernetes.io/is-attached` or `Type: kubernetes.io/attach-failed` are not updated, setting a timeout will prevent the scheduler from waiting indefinitely in the PreBind phase.
374
+
364
375
#### Handling ResourceSlices Upon Failure of Attachment
365
376
366
377
During the scheduling cycle, the DRA plugin reserves a `ResourceSlice` for the `ResourceClaim`.
@@ -370,9 +381,9 @@ If a fabric device is selected, the scheduler waits for the device attachment du
370
381
The composable controller performs the attachment operation by checking the flag of the `ResourceClaim`.
371
382
If the attachment fails, the following steps are taken:
372
383
373
-
1. **Update ResourceClaim**: The composable controller updates the `ResourceClaim` to indicate the failure of the attachment by setting a condition with `Type: kubernetes.io/is-attached` and `Status: False`.
384
+
1. **Update ResourceClaim**: The composable controller updates the `AllocatedDeviceStatus` to indicate the failure of the attachment by setting a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
374
385
2. **Fail the Binding Cycle**: The scheduler detects the failed attachment condition and fails the binding cycle. This prevents the pod from proceeding with an unattached device.
375
-
3. **Unbind ResourceClaim and ResourceSlice**: The scheduler unbinds the `ResourceClaim` and `ResourceSlice`, clearing the allocation to prevent the fabric device from being used in the pool.
386
+
3. **Unbind ResourceClaim and ResourceSlice**: The scheduler DRA plugin unbinds the `ResourceClaim` and `ResourceSlice` in `Unreserve`, clearing the allocation to prevent the fabric device from being used in the `ResourceClaim`.
376
387
4. **Retry Scheduling**: In the next scheduling cycle, the scheduler attempts to bind the `ResourceClaim` again.
0 commit comments