Skip to content

Commit f0a51d3

Browse files
committed
Update "Risks and Mitigations" and add "Alternative approach"
1 parent ff1ac0a commit f0a51d3

File tree

1 file changed

+81
-7
lines changed
  • keps/sig-scheduling/5007-device-attach-before-pod-scheduled

1 file changed

+81
-7
lines changed

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

Lines changed: 81 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ tags, and then generate with `hack/update-toc.sh`.
9696
- [PreBind Phase Timeout](#prebind-phase-timeout)
9797
- [Handling ResourceSlices Upon Failure of Attachment](#handling-resourceslices-upon-failure-of-attachment)
9898
- [Composable Controller Design Overview](#composable-controller-design-overview)
99+
- [Alternative approach](#alternative-approach)
99100
- [Test Plan](#test-plan)
100101
- [Prerequisite testing updates](#prerequisite-testing-updates)
101102
- [Unit tests](#unit-tests)
@@ -282,6 +283,24 @@ How will UX be reviewed, and by whom?
282283
Consider including folks who also work outside the SIG or subproject.
283284
-->
284285

286+
**What if the scheduler restarts while the DRA plugin is waiting for the device(s) to be bound?**
287+
288+
The scheduler's restart should not pose an issue, as the decision to wait is based on the Conditions of the ResourceClaim.
289+
After a scheduler restart, if the device attachment is not yet complete, the scheduler will wait again at PreBind.
290+
If the attachment is complete, it will pass through PreBind.
291+
292+
**Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down.**
293+
294+
Regarding collaboration with the Cluster Autoscaler, using node nomination can address the issue.
295+
This issue needs to be resolved before the beta is released.
296+
297+
**The in-flight events cache may grow too large when waiting in PreBind.**
298+
299+
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
300+
This prevents issues in the scheduling queue caused by keeping pods at PreBind for an extended period.
301+
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
302+
This issue needs to be resolved before the beta is released.
303+
285304
## Design Details
286305

287306
### DRA Scheduler Plugin Design Overview
@@ -317,10 +336,48 @@ type Device struct {
317336

318337
// BasicDevice represents a basic device instance.
319338
type BasicDevice struct {
320-
// Attributes contains additional attributes of the device.
339+
// Attributes defines the set of attributes for this device.
340+
// The name of each attribute must be unique in that set.
341+
//
342+
// The maximum number of attributes and capacities combined is 32.
321343
//
322344
// +optional
323-
Attributes map[string]string
345+
Attributes map[QualifiedName]DeviceAttribute
346+
...
347+
348+
}
349+
350+
...
351+
// DeviceAttribute must have exactly one field set.
352+
type DeviceAttribute struct {
353+
// The Go field names below have a Value suffix to avoid a conflict between the
354+
// field "String" and the corresponding method. That method is required.
355+
// The Kubernetes API is defined without that suffix to keep it more natural.
356+
357+
// IntValue is a number.
358+
//
359+
// +optional
360+
// +oneOf=ValueType
361+
IntValue *int64
362+
363+
// BoolValue is a true/false value.
364+
//
365+
// +optional
366+
// +oneOf=ValueType
367+
BoolValue *bool
368+
369+
// StringValue is a string. Must not be longer than 64 characters.
370+
//
371+
// +optional
372+
// +oneOf=ValueType
373+
StringValue *string
374+
375+
// VersionValue is a semantic version according to semver.org spec 2.0.0.
376+
// Must not be longer than 64 characters.
377+
//
378+
// +optional
379+
// +oneOf=ValueType
380+
VersionValue *string
324381
}
325382
```
326383

@@ -329,7 +386,7 @@ To indicate a fabric device, the following attribute will be added:
329386
```yaml
330387
attributes:
331388
kubernetes.io/needs-attaching:
332-
bool: "true"
389+
boolValue: "true"
333390
```
334391
335392
#### `AllocatedDeviceStatus` Additions
@@ -358,14 +415,16 @@ const(
358415
```
359416

360417
#### Scheduler DRA plugin Additions
361-
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following:
418+
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following at `PreBind`:
362419

363420
1. Set `AllocatedDeviceStatus.NodeName`.
364421
2. Add an `AllocatedDeviceStatus` with a condition of `Type: kubernetes.io/needs-attaching` and `Status: True`.
365422
3. Wait for a condition with `Type: kubernetes.io/is-attached` and `Status: True` in `PreBind` before proceeding.
366-
4. Give up when seeing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
423+
4. Reject the pod when observing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
367424

368425
Note: There is a concern that the in-flight events cache may grow too large when waiting in PreBind.
426+
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
427+
This prevents issues in the scheduling queue caused by keeping pods at PreBind for an extended period.
369428
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
370429

371430
#### PreBind Phase Timeout
@@ -411,12 +470,12 @@ devices:
411470
attributes:
412471
...
413472
kubernetes.io/needs-attaching:
414-
bool: "true"
473+
boolValue: "true"
415474
- name: device2
416475
attributes:
417476
...
418477
kubernetes.io/needs-attaching:
419-
bool: "true"
478+
boolValue: "true"
420479
```
421480

422481
The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as `ResourceSlices`.
@@ -465,6 +524,21 @@ devices:
465524
...
466525
```
467526

527+
### Alternative approach
528+
Instead of implementing the solution within the scheduler, we propose using the Cluster Autoscaler to manage the attachment and detachment of fabric devices.
529+
530+
The key points and main process flow of this alternative proposal are as follows:
531+
532+
The scheduler references only node-local ResourceSlices.
533+
If there are no available resources in the node-local ResourceSlices, the scheduler marks the Pod as unschedulable without waiting in the PreBind phase of the ResourceClaim.
534+
535+
To handle fabric resources, we implement the Processor for composable system within CA.
536+
This Processor identifies unschedulable Pods and determines if attaching a fabric ResourceSlice device to an existing node would make scheduling possible.
537+
If so, the Processor instructs the attachment of the resource, using the composable Operator for the actual attachment process.
538+
If attaching the fabric ResourceSlice does not make scheduling possible, the Processor determines whether to add a new node as usual.
539+
540+
After the device is attached, the vendor DRA updates the node-local ResourceSlices.
541+
The vendor DRA needs a rescan function to update the Pool/ResourceSlice. The scheduler can then assign the node-local ResourceSlice devices to the unschedulable Pod, operating the same as the usual DRA from this point.
468542

469543

470544
### Test Plan

0 commit comments

Comments
 (0)