Skip to content

Commit 519667a

Browse files
authored
Update README.md
1 parent ca2ef0a commit 519667a

File tree

1 file changed

+41
-45
lines changed
  • keps/sig-scheduling/5007-device-attach-before-pod-scheduled

1 file changed

+41
-45
lines changed

keps/sig-scheduling/5007-device-attach-before-pod-scheduled/README.md

Lines changed: 41 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,21 @@ The scheduler's restart should not pose an issue, as the decision to wait is bas
289289
After a scheduler restart, if the device attachment is not yet complete, the scheduler will wait again at PreBind.
290290
If the attachment is complete, it will pass through PreBind.
291291

292+
**Scheduler does not guarantee to pick up the same node for the Pod after the restart**
293+
294+
Basically scheduler should select the same node, however we need to consider the following scenarios:
295+
- In case of a failure, we might want to try a different node.
296+
- During rescheduling, if another pod is deployed on that node and uses the resources, the rescheduled pod might not be able to be deployed.
297+
Therefore, we need logic to prioritize the rescheduled pod on that node.
298+
299+
Node nomination would solve this.
300+
If node nomination is not available, processing flow is as follows.
301+
If the pod is assigned to another node after the scheduler restarts, additional device will be attached to that node.
302+
If the device attached to the original node is not used, user can manually detach the device.
303+
(Of course, we can leave it attached to that node for future use by the Pod.)
304+
305+
This issue needs to be resolved before the beta is released.
306+
292307
**Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down.**
293308

294309
Regarding collaboration with the Cluster Autoscaler, using node nomination can address the issue.
@@ -455,7 +470,7 @@ Therefore, we prefer to allocate devices directly connected to the node over att
455470

456471
This design aims to efficiently utilize fabric devices, prioritizing node-local devices to improve performance.
457472
The composable controller manages fabric devices that can be attached and detached.
458-
Therefore, it publishes a list of fabric devices as `ResourceSlices`.
473+
Therefore, it publishes a list of free fabric devices as `ResourceSlices`.
459474

460475
The structure we are considering is as follows:
461476

@@ -524,6 +539,24 @@ devices:
524539
...
525540
```
526541

542+
Composable DRA controller exposes free devices list on the fabric that is not yet connected to a node as a ResourceSlice.
543+
Controller refreshes the ResourceSlice periodically (every 10 seconds).
544+
This means that it reflects the latest list of devices on the fabric.
545+
It does not "detect attach or detach to nodes and update them immediately in event handlers, etc."
546+
This is because it is difficult for a Composable DRA running on K8s to cover all cases where a ResourceSlice needs to be updated, such as when a new device is physically added to the fabric.
547+
We also expect vendor DRAs to periodically update the list of devices connected to the node as a ResourceSlice. This requires the rescan function to be run periodically.
548+
549+
Devices in composable ResourceSlice has a unique device name.
550+
However, that the device name is not an identifying name (for example, UUID).
551+
In Composable System, users attach devices by specifying the model name and number of devices they need.
552+
And until the device is actually attached to the node, the user does not know which specific individual is attached.
553+
554+
Because of this concept, the Pool and ResourceSlice exposed by Composable DRA controller are separate for each model.
555+
The devices in the Pool for each model have unique device names, but are essentially information about how many devices of this model are in the ResourceSlice.
556+
Composable DRA controoler also add a model name and so on into the attributes of each device.
557+
558+
![composable-resourceslice](composable-resourceslice.png)
559+
527560
### Alternative approach
528561
Instead of implementing the solution within the scheduler, we propose using the Cluster Autoscaler to manage the attachment and detachment of fabric devices.
529562

@@ -623,62 +656,25 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
623656

624657
### Graduation Criteria
625658

626-
<!--
627-
**Note:** *Not required until targeted at a release.*
628-
629-
Define graduation milestones.
630-
631-
These may be defined in terms of API maturity, [feature gate] graduations, or as
632-
something else. The KEP should keep this high-level with a focus on what
633-
signals will be looked at to determine graduation.
634-
635-
Consider the following in developing the graduation criteria for this enhancement:
636-
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
637-
- [Feature gate][feature gate] lifecycle
638-
- [Deprecation policy][deprecation-policy]
639-
640-
Clearly define what graduation means by either linking to the [API doc
641-
definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
642-
or by redefining what graduation means.
643-
644-
In general we try to use the same stages (alpha, beta, GA), regardless of how the
645-
functionality is accessed.
646-
647-
[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
648-
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
649-
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
650-
651-
Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
652-
653659
#### Alpha
654660

655-
- Feature implemented behind a feature flag
656-
- Initial e2e tests completed and enabled
661+
- Initial implementation is completed and enabled
657662

658663
#### Beta
659664

660665
- Gather feedback from developers and surveys
661-
- Complete features A, B, C
666+
- Resolove the following issues
667+
- Scheduler does not guarantee to pick up the same node for the Pod after the restart
668+
- Pods which are not bound yet (in api-server) and not unschedulable (in api-server) are not visible by cluster autoscaler, so there is a risk that the node will be turned down
669+
- The in-flight events cache may grow too large when waiting in PreBind
662670
- Additional tests are in Testgrid and linked in KEP
663671

664672
#### GA
665673

666-
- N examples of real-world usage
667-
- N installs
668-
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
669-
- Allowing time for feedback
670-
671-
**Note:** Generally we also wait at least two releases between beta and
672-
GA/stable, because there's no opportunity for user feedback, or even bug reports,
673-
in back-to-back releases.
674-
675-
**For non-optional features moving to GA, the graduation criteria must include
676-
[conformance tests].**
677-
678-
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
674+
TBD
679675

680676
#### Deprecation
681-
677+
<!--
682678
- Announce deprecation and support policy of the existing flag
683679
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
684680
- Address feedback on usage/changed behavior, provided on GitHub issues

0 commit comments

Comments
 (0)