You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -282,6 +283,24 @@ How will UX be reviewed, and by whom?
282
283
Consider including folks who also work outside the SIG or subproject.
283
284
-->
284
285
286
+
**What if the scheduler restarts while the DRA plugin is waiting for the device(s) to be bound?**
287
+
288
+
The scheduler's restart should not pose an issue, as the decision to wait is based on the Conditions of the ResourceClaim.
289
+
After a scheduler restart, if the device attachment is not yet complete, the scheduler will wait again at PreBind.
290
+
If the attachment is complete, it will pass through PreBind.
291
+
292
+
**Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down.**
293
+
294
+
Regarding collaboration with the Cluster Autoscaler, using node nomination can address the issue.
295
+
This issue needs to be resolved before the beta is released.
296
+
297
+
**The in-flight events cache may grow too large when waiting in PreBind.**
298
+
299
+
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
300
+
This prevents issues in the scheduling queue caused by keeping pods at PreBind for an extended period.
301
+
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
302
+
This issue needs to be resolved before the beta is released.
303
+
285
304
## Design Details
286
305
287
306
### DRA Scheduler Plugin Design Overview
@@ -317,10 +336,48 @@ type Device struct {
317
336
318
337
// BasicDevice represents a basic device instance.
319
338
typeBasicDevicestruct {
320
-
// Attributes contains additional attributes of the device.
339
+
// Attributes defines the set of attributes for this device.
340
+
// The name of each attribute must be unique in that set.
341
+
//
342
+
// The maximum number of attributes and capacities combined is 32.
321
343
//
322
344
// +optional
323
-
Attributesmap[string]string
345
+
Attributesmap[QualifiedName]DeviceAttribute
346
+
...
347
+
348
+
}
349
+
350
+
...
351
+
// DeviceAttribute must have exactly one field set.
352
+
typeDeviceAttributestruct {
353
+
// The Go field names below have a Value suffix to avoid a conflict between the
354
+
// field "String" and the corresponding method. That method is required.
355
+
// The Kubernetes API is defined without that suffix to keep it more natural.
356
+
357
+
// IntValue is a number.
358
+
//
359
+
// +optional
360
+
// +oneOf=ValueType
361
+
IntValue *int64
362
+
363
+
// BoolValue is a true/false value.
364
+
//
365
+
// +optional
366
+
// +oneOf=ValueType
367
+
BoolValue *bool
368
+
369
+
// StringValue is a string. Must not be longer than 64 characters.
370
+
//
371
+
// +optional
372
+
// +oneOf=ValueType
373
+
StringValue *string
374
+
375
+
// VersionValue is a semantic version according to semver.org spec 2.0.0.
376
+
// Must not be longer than 64 characters.
377
+
//
378
+
// +optional
379
+
// +oneOf=ValueType
380
+
VersionValue *string
324
381
}
325
382
```
326
383
@@ -329,7 +386,7 @@ To indicate a fabric device, the following attribute will be added:
329
386
```yaml
330
387
attributes:
331
388
kubernetes.io/needs-attaching:
332
-
bool: "true"
389
+
boolValue: "true"
333
390
```
334
391
335
392
#### `AllocatedDeviceStatus` Additions
@@ -358,14 +415,16 @@ const(
358
415
```
359
416
360
417
#### Scheduler DRA plugin Additions
361
-
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following:
418
+
When `kubernetes.io/needs-attaching: true` is set, the scheduler DRA plugin is expected to do the following at `PreBind`:
362
419
363
420
1. Set `AllocatedDeviceStatus.NodeName`.
364
421
2. Add an `AllocatedDeviceStatus` with a condition of `Type: kubernetes.io/needs-attaching` and `Status: True`.
365
422
3. Wait for a condition with `Type: kubernetes.io/is-attached` and `Status: True` in `PreBind` before proceeding.
366
-
4. Give up when seeing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
423
+
4. Reject the pod when observing a condition with `Type: kubernetes.io/attach-failed` and `Status: True`.
367
424
368
425
Note: There is a concern that the in-flight events cache may grow too large when waiting in PreBind.
426
+
To address the PreBind concern, the solution is to modify the scheduling framework to flush the in-flight events cache before PreBind.
427
+
This prevents issues in the scheduling queue caused by keeping pods at PreBind for an extended period.
369
428
This issue will be addressed separately as outlined in kubernetes/kubernetes#129967.
370
429
371
430
#### PreBind Phase Timeout
@@ -411,12 +470,12 @@ devices:
411
470
attributes:
412
471
...
413
472
kubernetes.io/needs-attaching:
414
-
bool: "true"
473
+
boolValue: "true"
415
474
- name: device2
416
475
attributes:
417
476
...
418
477
kubernetes.io/needs-attaching:
419
-
bool: "true"
478
+
boolValue: "true"
420
479
```
421
480
422
481
The vendor's DRA kubelet plugin will also publish the devices managed by the vendor as `ResourceSlices`.
@@ -465,6 +524,21 @@ devices:
465
524
...
466
525
```
467
526
527
+
### Alternative approach
528
+
Instead of implementing the solution within the scheduler, we propose using the Cluster Autoscaler to manage the attachment and detachment of fabric devices.
529
+
530
+
The key points and main process flow of this alternative proposal are as follows:
531
+
532
+
The scheduler references only node-local ResourceSlices.
533
+
If there are no available resources in the node-local ResourceSlices, the scheduler marks the Pod as unschedulable without waiting in the PreBind phase of the ResourceClaim.
534
+
535
+
To handle fabric resources, we implement the Processor for composable system within CA.
536
+
This Processor identifies unschedulable Pods and determines if attaching a fabric ResourceSlice device to an existing node would make scheduling possible.
537
+
If so, the Processor instructs the attachment of the resource, using the composable Operator for the actual attachment process.
538
+
If attaching the fabric ResourceSlice does not make scheduling possible, the Processor determines whether to add a new node as usual.
539
+
540
+
After the device is attached, the vendor DRA updates the node-local ResourceSlices.
541
+
The vendor DRA needs a rescan function to update the Pool/ResourceSlice. The scheduler can then assign the node-local ResourceSlice devices to the unschedulable Pod, operating the same as the usual DRA from this point.
0 commit comments