Skip to content

Commit 7c58919

Browse files
committed
Addressed more comments
1 parent 2c98d9d commit 7c58919

File tree

1 file changed

+50
-30
lines changed
  • keps/sig-scheduling/5194-reserved-for-workloads

1 file changed

+50
-30
lines changed

keps/sig-scheduling/5194-reserved-for-workloads/README.md

Lines changed: 50 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -336,7 +336,7 @@ references a `ResourceClaim` with an empty `status.ReservedFor` list, it knows t
336336
is the first pod that will be consuming the claim.
337337

338338
If the `spec.ReservedFor` field in the ResourceClaim is not set, the scheduler will handle
339-
the `ResourceClaim` in the same was as now, and will add the `Pod` to the `ReservedFor` list
339+
the `ResourceClaim` in the same way as now, and will add the `Pod` to the `ReservedFor` list
340340
if devices could be allocated for the claim. Any additional pods that reference the `ResourceClaim`
341341
will also be added to the list.
342342

@@ -348,24 +348,24 @@ list, it will not add a reference to the pod.
348348

349349
##### Deallocation
350350
The resourceclaim controller will remove Pod references from the `ReservedFor` list just
351-
like it does now using the same logic. For non-Pod references, the controller will recognize
352-
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will
353-
remove the reference from the list when those resources are removed. For other types,
354-
it will be the responsibility of the workload controller/user that created the `ResourceClaim`
355-
to remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
351+
like it does now using the same logic. But for non-Pod references, it
352+
will be the responsibility of the controller/user that created the `ResourceClaim` to
353+
remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
356354
are consuming the `ResourceClaim` and no new pods will be created that references
357355
the `ResourceClaim`.
358356

359357
The resourceclaim controller will then discover that the `ReservedFor` list is empty
360358
and therefore know that it is safe to deallocate the `ResourceClaim`.
361359

362-
This requires that the resourceclaim controller watches the workload types that will
363-
be supported. For other types of workloads, there will be a requirement that the workload
364-
controller has permissions to update the status subresource of the `ResourceClaim`. The
365-
resourceclaim controller will also try to detect if an unknown resource referenced in the
366-
`ReservedFor` list has been deleted from the cluster, but that requires that the controller
367-
has permissions to get or list resources of the type. If the resourceclaim controller is
368-
not able to check, it will just wait until the reference in the `ReservedFor` list is removed.
360+
This requires that the controller/user has permissions to update the status
361+
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if
362+
the resource referenced in the `ReservedFor` list has been deleted from the cluster, but
363+
that requires that the controller has permissions to get or list resources of the type. If the
364+
resourceclaim controller is not able to check, it will just wait until the reference in
365+
the `ReservedFor` list is removed. The resourceclaim controller will not have a watch
366+
on the workload resource, so there is no guarantee that the controller will realize that
367+
the resource has been deleted. This is an extra check since it is the responsibility of
368+
the workload controller to update the claim.
369369

370370
##### Finding pods using a ResourceClaim
371371
If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer
@@ -386,7 +386,7 @@ but haven't yet been scheduled. This distinction is important for some of the us
386386
1. If the DRA scheduler plugin is trying to find candidates for deallocation in
387387
the `PostFilter` function and sees a `ResourceClaim` with a non-Pod reference, it will not
388388
attempt to deallocate. The plugin has no way to know how many pods are actually consuming
389-
the `ResourceClaim` without the explit list in the `ReservedFor` list and therefore it will
389+
the `ResourceClaim` without the explicit list in the `ReservedFor` list and therefore it will
390390
not be safe to deallocate.
391391

392392
1. The device_taint_eviction controller will use the list of pods referencing the `ResourceClaim`
@@ -455,7 +455,10 @@ Additional e2e tests will be added to verify the behavior added in this KEP.
455455
- All security enforcement completed
456456
- All monitoring requirements completed
457457
- All testing requirements completed
458-
- All known pre-release issues and gaps resolved
458+
- All known pre-release issues and gaps resolved
459+
- Revisit whether the responsibility of removing the workload resource reference from
460+
the `ReservedFor` list should be with the workload controller (as proposed in this design)
461+
or be handled by the resourceclaim controller.
459462

460463
#### GA
461464

@@ -473,6 +476,19 @@ The API server will no longer accept the new fields and the other components wil
473476
not know what to do with them. So the result is that the `ReservedFor` list will only
474477
have references to pod resources like today.
475478

479+
Any ResourceClaims that have already been allocated when the feature was active will
480+
have non-pod references in the `ReservedFor` list after a downgrade, but the controllers
481+
will not know how to handle it. There are two problems that will arise as a result of
482+
this:
483+
- The workload controller will also have been downgraded if it is in-tree, meaning that
484+
it will not remove the reference to workload resource from the `ReservedFor` list, thus
485+
leading to a situation where the claim will never be deallocated.
486+
- For new pods that gets scheduled, the scheduler will add pod references in the
487+
`ReservedFor` list, despite there being a non-pod reference here. So it ends up with
488+
both pod and non-pod references in the list. We need to make sure the system can
489+
handle this, as it might also happen as a result of disablement and the enablement
490+
of the feature.
491+
476492
### Version Skew Strategy
477493

478494
If the kubelet is on a version that doesn't support the feature but the rest of the
@@ -482,10 +498,9 @@ since it will still check whether the `Pod` is references in the `ReservedFor` l
482498
If the API server is on a version that supports the feature, but the scheduler
483499
is not, the scheduler will not know about the new fields added, so it will
484500
put the reference to the `Pod` in the `ReservedFor` list rather than the reference
485-
in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but
486-
it will be subject to the 256 limit on the size of the `ReservedFor` list and the
487-
controller creating the `ResourceClaim` will not find the reference it expects
488-
in the `ReservedFor` list when it tries to remove it.
501+
in the `spec.ReservedFor` list. It will do this even if there is already a non-pod
502+
reference in the `spec.ReservedFor` list. This leads to the challenge described
503+
in the previous section.
489504

490505
## Production Readiness Review Questionnaire
491506

@@ -543,18 +558,21 @@ No
543558

544559
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
545560

546-
Applications that were already running will continue to run and the allocated
547-
devices will remain so.
548-
For the resource types supported directly, the resource claim controller will not remove the
549-
reference in the `ReservedFor` list, meaning the devices will not be deallocated. If the workload
550-
controller is responsible for removing the reference, deallocation will work as long as the
551-
feature isn't also disabled in the controllers. If they are, deallocation will not happen in this
552-
situation either.
561+
Applications that were already running will continue to run. But if a pod have to be
562+
re-admitted by a kubelet where the feature has been disabled, it will not be able to, since
563+
the kubelet will not find a reference to the pod in the `ReservedFor` list.
564+
565+
The feature will also be disabled for in-tree workload controllers, meaning that they will
566+
not remove the reference to the pod from the `ReservedFor` list. This means the list will never
567+
be empty and the resourceclaim controller will never deallocate the claim.
553568

554569
###### What happens if we reenable the feature if it was previously rolled back?
555570

556571
It will take affect again and will impact how the `ReservedFor` field is used during allocation
557-
and deallocation.
572+
and deallocation. Since this scenario allows a ResourceClaim with the `spec.ReservedFor` field
573+
to be set and then have the scheduler populate the `ReservedFor` list with pods when the feature
574+
is disabled, we will end up in a situation where the `ReservedFor` list can contain both non-pod
575+
and pod references. We need to make sure all components can handle that.
558576

559577
###### Are there any tests for feature enablement/disablement?
560578

@@ -723,7 +741,10 @@ No
723741

724742
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
725743

726-
No
744+
Yes and no. We are adding two new fields to the ResourceClaim type, but neither are of a collection type
745+
so they should have limited impact on the total size of the objects. However, this feature means that
746+
we no longer need to keep a complete list of all pods using a ResourceClaim, which can significantly
747+
reduce the size of ResourceClaim objects shared by many pods.
727748

728749
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
729750

@@ -732,8 +753,7 @@ No
732753
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
733754

734755
It might require some additional memory usage in the resourceclaim controller since it will need to keep an index
735-
of ResourceClaim to Pods. The resourceclaim controller will also have watches for Deployments, StatefulSets, and
736-
Jobs which might also cause a slight increase in memory usage.
756+
of ResourceClaim to Pods.
737757

738758
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
739759

0 commit comments

Comments
 (0)