You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md
+41-45Lines changed: 41 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -289,6 +289,21 @@ The scheduler's restart should not pose an issue, as the decision to wait is bas
289
289
After a scheduler restart, if the device attachment is not yet complete, the scheduler will wait again at PreBind.
290
290
If the attachment is complete, it will pass through PreBind.
291
291
292
+
**Scheduler does not guarantee to pick up the same node for the Pod after the restart**
293
+
294
+
Basically scheduler should select the same node, however we need to consider the following scenarios:
295
+
- In case of a failure, we might want to try a different node.
296
+
- During rescheduling, if another pod is deployed on that node and uses the resources, the rescheduled pod might not be able to be deployed.
297
+
Therefore, we need logic to prioritize the rescheduled pod on that node.
298
+
299
+
Node nomination would solve this.
300
+
If node nomination is not available, processing flow is as follows.
301
+
If the pod is assigned to another node after the scheduler restarts, additional device will be attached to that node.
302
+
If the device attached to the original node is not used, user can manually detach the device.
303
+
(Of course, we can leave it attached to that node for future use by the Pod.)
304
+
305
+
This issue needs to be resolved before the beta is released.
306
+
292
307
**Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down.**
293
308
294
309
Regarding collaboration with the Cluster Autoscaler, using node nomination can address the issue.
@@ -455,7 +470,7 @@ Therefore, we prefer to allocate devices directly connected to the node over att
455
470
456
471
This design aims to efficiently utilize fabric devices, prioritizing node-local devices to improve performance.
457
472
The composable controller manages fabric devices that can be attached and detached.
458
-
Therefore, it publishes a list of fabric devices as `ResourceSlices`.
473
+
Therefore, it publishes a list of free fabric devices as `ResourceSlices`.
459
474
460
475
The structure we are considering is as follows:
461
476
@@ -524,6 +539,24 @@ devices:
524
539
...
525
540
```
526
541
542
+
Composable DRA controller exposes free devices list on the fabric that is not yet connected to a node as a ResourceSlice.
543
+
Controller refreshes the ResourceSlice periodically (every 10 seconds).
544
+
This means that it reflects the latest list of devices on the fabric.
545
+
It does not "detect attach or detach to nodes and update them immediately in event handlers, etc."
546
+
This is because it is difficult for a Composable DRA running on K8s to cover all cases where a ResourceSlice needs to be updated, such as when a new device is physically added to the fabric.
547
+
We also expect vendor DRAs to periodically update the list of devices connected to the node as a ResourceSlice. This requires the rescan function to be run periodically.
548
+
549
+
Devices in composable ResourceSlice has a unique device name.
550
+
However, that the device name is not an identifying name (for example, UUID).
551
+
In Composable System, users attach devices by specifying the model name and number of devices they need.
552
+
And until the device is actually attached to the node, the user does not know which specific individual is attached.
553
+
554
+
Because of this concept, the Pool and ResourceSlice exposed by Composable DRA controller are separate for each model.
555
+
The devices in the Pool for each model have unique device names, but are essentially information about how many devices of this model are in the ResourceSlice.
556
+
Composable DRA controoler also add a model name and so on into the attributes of each device.
Instead of implementing the solution within the scheduler, we propose using the Cluster Autoscaler to manage the attachment and detachment of fabric devices.
529
562
@@ -623,62 +656,25 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
623
656
624
657
### Graduation Criteria
625
658
626
-
<!--
627
-
**Note:** *Not required until targeted at a release.*
628
-
629
-
Define graduation milestones.
630
-
631
-
These may be defined in terms of API maturity, [feature gate] graduations, or as
632
-
something else. The KEP should keep this high-level with a focus on what
633
-
signals will be looked at to determine graduation.
634
-
635
-
Consider the following in developing the graduation criteria for this enhancement:
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
652
-
653
659
#### Alpha
654
660
655
-
- Feature implemented behind a feature flag
656
-
- Initial e2e tests completed and enabled
661
+
- Initial implementation is completed and enabled
657
662
658
663
#### Beta
659
664
660
665
- Gather feedback from developers and surveys
661
-
- Complete features A, B, C
666
+
- Resolove the following issues
667
+
- Scheduler does not guarantee to pick up the same node for the Pod after the restart
668
+
- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
669
+
- The in-flight events cache may grow too large when waiting in PreBind
662
670
- Additional tests are in Testgrid and linked in KEP
663
671
664
672
#### GA
665
673
666
-
- N examples of real-world usage
667
-
- N installs
668
-
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
669
-
- Allowing time for feedback
670
-
671
-
**Note:** Generally we also wait at least two releases between beta and
672
-
GA/stable, because there's no opportunity for user feedback, or even bug reports,
673
-
in back-to-back releases.
674
-
675
-
**For non-optional features moving to GA, the graduation criteria must include
0 commit comments