Skip to content

Commit ebcc862

Browse files
committed
Update: Use the BindingGates approach.
1 parent cd9e43f commit ebcc862

File tree

3 files changed

+79
-83
lines changed

3 files changed

+79
-83
lines changed

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

Lines changed: 78 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,9 @@ tags, and then generate with `hack/update-toc.sh`.
9090
- [Risks and Mitigations](#risks-and-mitigations)
9191
- [Design Details](#design-details)
9292
- [DRA Scheduler Plugin Design Overview](#dra-scheduler-plugin-design-overview)
93-
- [Device Attribute Additions](#device-attribute-additions)
94-
- [<code>AllocatedDeviceStatus</code> Additions](#allocateddevicestatus-additions)
95-
- [Scheduler DRA plugin Additions](#scheduler-dra-plugin-additions)
93+
- [BasicDevice Enhancements](#basicdevice-enhancements)
94+
- [AllocatedDeviceStatus Enhancements](#allocateddevicestatus-enhancements)
95+
- [Scheduler DRA Plugin Modifications](#scheduler-dra-plugin-modifications)
9696
- [PreBind Phase Timeout](#prebind-phase-timeout)
9797
- [Handling ResourceSlices Upon Failure of Attachment](#handling-resourceslices-upon-failure-of-attachment)
9898
- [Composable Controller Design Overview](#composable-controller-design-overview)
@@ -315,7 +315,7 @@ This issue needs to be resolved before the beta is released.
315315

316316
**The in-flight events cache may grow too large when waiting in PreBind.**
317317

318-
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
318+
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
319319
This prevents issues in the scheduling queue caused by keeping pods at PreBind for an extended period.
320320
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
321321
This issue needs to be resolved before the beta is released.
@@ -330,11 +330,10 @@ The composable controller design is also discussed, emphasizing efficient utiliz
330330

331331
![proposal](proposal.jpg)
332332

333-
#### Device Attribute Additions
333+
#### BasicDevice Enhancements
334334

335-
To indicate whether a device is a fabric device, an attribute is added to the `Basic` within `Device`.
336-
This attribute will be used by the controller that exposes the `ResourceSlice` to notify whether the device is a fabric device.
337-
To avoid impacting existing DRA functionality, the default value of this attribute is set to `false`.
335+
To indicate whether a device is a fabric device, fields are added to the `Basic` within `Device`.
336+
These fields will be used by the controller that exposes the `ResourceSlice` to notify whether the device is a fabric device.
338337

339338
```go
340339
// Device represents one individual hardware instance that can be selected based
@@ -362,87 +361,87 @@ type BasicDevice struct {
362361
//
363362
// +optional
364363
Attributes map[QualifiedName]DeviceAttribute
365-
...
366-
367-
}
368364

369-
...
370-
// DeviceAttribute must have exactly one field set.
371-
type DeviceAttribute struct {
372-
// The Go field names below have a Value suffix to avoid a conflict between the
373-
// field "String" and the corresponding method. That method is required.
374-
// The Kubernetes API is defined without that suffix to keep it more natural.
375-
376-
// IntValue is a number.
377-
//
378-
// +optional
379-
// +oneOf=ValueType
380-
IntValue *int64
381-
382-
// BoolValue is a true/false value.
383-
//
384-
// +optional
385-
// +oneOf=ValueType
386-
BoolValue *bool
387-
388-
// StringValue is a string. Must not be longer than 64 characters.
389-
//
390-
// +optional
391-
// +oneOf=ValueType
392-
StringValue *string
393-
394-
// VersionValue is a semantic version according to semver.org spec 2.0.0.
395-
// Must not be longer than 64 characters.
396-
//
397-
// +optional
398-
// +oneOf=ValueType
399-
VersionValue *string
400-
}
401-
```
365+
// BindingGates defines the gates for binding.
366+
//
367+
// +optional
368+
BindingGates []string
402369

403-
To indicate a fabric device, the following attribute will be added:
370+
// BindingFailureGates defines the gates for binding failure.
371+
//
372+
// +optional
373+
BindingFailureGates []string
404374

405-
```yaml
406-
attributes:
407-
kubernetes.io/needs-attaching:
408-
boolValue: "true"
375+
// UsageRestrictedToNode indicates if the usage of an allocation involving this device
376+
// has to be limited to exactly the node that was chosen when allocating the claim.
377+
//
378+
// +optional
379+
UsageRestrictedToNode bool
380+
}
409381
```
410382

411-
#### `AllocatedDeviceStatus` Additions
383+
#### AllocatedDeviceStatus Enhancements
412384

413-
The `Conditions` field within `AllocatedDeviceStatus` is used to indicate the status of the device attachment.
414-
This field will contain a list of conditions, each representing a specific state or event related to the device.
385+
The `BindingGates` and `BindingFailureGates` fields within `AllocatedDeviceStatus` are used to indicate the status of the device attachment.
386+
These fields will contain a list of conditions, each representing a specific state or event related to the device.
415387

416-
For this feature, the NodeName and following `ConditionType` constants are added:
388+
For this feature, following fields are added:
417389

418390
```go
419391
// AllocatedDeviceStatus contains the status of an allocated device, if the
420392
// driver chooses to report it. This may include driver-specific information.
421393
type AllocatedDeviceStatus struct {
422-
...
423-
// NodeName contains the name of the node where the device needs to be attached.
394+
...
395+
// BindingGates defines the gates for binding.
424396
//
425397
// +optional
426-
NodeName string
427-
}
398+
BindingGates map[string]bool
428399

429-
const(
430-
DRADeviceNeedAttachType = "kubernetes.io/needs-attaching"
431-
DRADeviceIsAttachType = "kubernetes.io/is-attached"
432-
DRADeviceAttachFailType = "kubernetes.io/attach-failed"
433-
)
400+
// BindingFailureGates defines the gates for binding failure.
401+
//
402+
// +optional
403+
BindingFailureGates map[string]bool
404+
}
434405
```
435406

436-
#### Scheduler DRA plugin Additions
437-
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following at `PreBind`:
407+
#### Scheduler DRA Plugin Modifications
408+
409+
When `UsageRestrictedToNode: true` is set, the scheduler DRA plugin will perform the following steps:
410+
411+
1. **Set NodeSelector**: Before the `PreBind` phase, add the `NodeName` to the `ResourceClaim`'s `NodeSelector`.
412+
413+
If Gates are present, the scheduler DRA plugin will perform the following steps during the `PreBind` phase:
438414

439-
1. Set `AllocatedDeviceStatus.NodeName`.
440-
2. Add an `AllocatedDeviceStatus` with a condition of `Type: kubernetes.io/needs-attaching` and `Status: True`.
441-
3. Wait for a condition with `Type: kubernetes.io/is-attached` and `Status: True` in `PreBind` before proceeding.
442-
4. Reject the pod when observing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
415+
2. **Copy Gates**: Copy `BindingGates` and `BindingFailureGates` from `ResourceSlice.Device.Basic` to `AllocatedDeviceStatus`.
416+
3. **Wait for Conditions**: Wait for the following conditions:
417+
- If `NeedToPreparing` is `True`, wait until `IsPrepared` is `True` before proceeding to Bind.
418+
- If `PreparingFailed` is `True`, clear the allocation in the `ResourceClaim` and reschedule the Pod.
419+
- If the preparation takes longer than the `PreparingTimeout` period, clear the allocation in the `ResourceClaim` and reschedule the Pod.
420+
421+
To support these steps, the following keys are defined:
422+
423+
```go
424+
const (
425+
// NeedToPreparing indicates that this device needs some preparation.
426+
// If this flag is True, the scheduler waits in PreBind.
427+
NeedToPreparing = "kubernetes.io/need-to-preparing"
428+
429+
// IsPrepared indicates the device ready state.
430+
// If NeedToPreparing is True and IsPrepared is True, the scheduler proceeds to Bind.
431+
IsPrepared = "kubernetes.io/is-prepared"
432+
433+
// PreparingFailed indicates the device preparation failed state.
434+
// If PreparingFailed is True, the scheduler will clear the allocation in the ResourceClaim and reschedule the Pod.
435+
PreparingFailed = "kubernetes.io/preparing-failed"
436+
437+
// PreparingTimeout indicates the prepare timeout period.
438+
// If the timeout period is exceeded, the scheduler clears the allocation in the ResourceClaim and reschedules the Pod.
439+
PreparingTimeout = "kubernetes.io/preparing-timeout"
440+
)
441+
```
443442

444443
Note: There is a concern that the in-flight events cache may grow too large when waiting in PreBind.
445-
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
444+
To address this, the scheduling framework will be modified to flush the in-flight events cache before PreBind.
446445
This prevents issues in the scheduling queue caused by keeping pods at PreBind for an extended period.
447446
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
448447

@@ -451,7 +450,7 @@ This issue will be addressed separately as outlined in kubernetes/kubernetes#129
451450
If the device attachment is successful, we expect it to take no longer than 5 minutes.
452451
However, to account for potential update lags, we would like to set a fixed timeout for the scheduler to 10 minutes.
453452

454-
Even if the conditions `Type: kubernetes.io/is-attached` or `Type: kubernetes.io/attach-failed` are not updated, setting a timeout will prevent the scheduler from waiting indefinitely in the PreBind phase.
453+
Even if the conditions indicating that the device is attached or that the attachment failed are not updated, setting a timeout will prevent the scheduler from waiting indefinitely in the PreBind phase.
455454

456455
#### Handling ResourceSlices Upon Failure of Attachment
457456

@@ -486,15 +485,11 @@ driver: gpu.nvidia.com
486485
nodeSelector: fabric1
487486
devices:
488487
- name: device1
489-
attributes:
490-
...
491-
kubernetes.io/needs-attaching:
492-
boolValue: "true"
488+
UsageRestrictedToNode: true
489+
...
493490
- name: device2
494-
attributes:
495-
...
496-
kubernetes.io/needs-attaching:
497-
boolValue: "true"
491+
UsageRestrictedToNode: true
492+
...
498493
```
499494

500495
The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as `ResourceSlices`.
@@ -544,11 +539,12 @@ devices:
544539
```
545540

546541
Composable DRA controller exposes free devices list on the fabric that is not yet connected to a node as a ResourceSlice.
547-
Controller refreshes the ResourceSlice periodically (every 10 seconds).
542+
Controller refreshes the ResourceSlice periodically (every 10 seconds).
548543
This means that it reflects the latest list of devices on the fabric.
549544
It does not "detect attach or detach to nodes and update them immediately in event handlers, etc."
550545
This is because it is difficult for a Composable DRA running on K8s to cover all cases where a ResourceSlice needs to be updated, such as when a new device is physically added to the fabric.
551-
We also expect vendor DRAs to periodically update the list of devices connected to the node as a ResourceSlice. This requires the rescan function to be run periodically.
546+
We also expect vendor DRAs to periodically update the list of devices connected to the node as a ResourceSlice
547+
This requires the rescan function to be run periodically.
552548

553549
Devices in composable ResourceSlice has a unique device name.
554550
However, that the device name is not an identifying name (for example, UUID).
@@ -571,7 +567,7 @@ And then, device autoscaler tries to attach new devices.
571567
And it also try to detach devices if they have not been used for a period of time.
572568
This is similar to the concept of CA.
573569

574-
However, if CA and device autoscaler is running independently, CA may add a node with a device at the same time as the device autoscaler attaches the device.
570+
However, if CA and device autoscaler is running independently, CA may add a node with a device at the same time as the device autoscaler attaches the device.
575571
This is a wasted resource addition.
576572
Therefore, there is the following idea that putting device-scale functionality in CA.
577573

@@ -581,7 +577,7 @@ If so, the Processor instructs the attachment of the resource, using the composa
581577
If attaching the fabric ResourceSlice does not make scheduling possible, the Processor determines whether to add a new node as usual.
582578

583579
After the device is attached, the vendor DRA updates the node-local ResourceSlices.
584-
The vendor DRA needs a rescan function to update the Pool/ResourceSlice.
580+
The vendor DRA needs a rescan function to update the Pool/ResourceSlice.
585581
The scheduler can then assign the node-local ResourceSlice devices to the unschedulable Pod, operating the same as the usual DRA from this point.
586582

587583
### Test Plan

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ owning-sig: sig-scheduling
88
# - sig-bbb
99
status: implementable
1010
#|implemented|deferred|rejected|withdrawn|replaced
11-
creation-date: 2025-02-04
11+
creation-date: 2025-02-10
1212
reviewers:
1313
- "@pohly"
1414
- "@dom4ha"
1.19 KB
Loading

0 commit comments

Comments
 (0)