Skip to content

Commit 2c98d9d

Browse files
committed
Addressed comments
1 parent 61ec86f commit 2c98d9d

File tree

2 files changed

+35
-17
lines changed
  • keps

2 files changed

+35
-17
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
kep-number: 5194
22
alpha:
3-
approver: "@johnbelamaric"
3+
approver: "@wojtek-t"

keps/sig-scheduling/5194-reserved-for-workloads/README.md

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ KEP, the hard limit on the number of pods will be removed.
177177
Training workloads that uses TPUs can be very large, requiring over 9,000
178178
TPUs for a single training job. The number of TPUs for each node is usually 4, meaning
179179
that the job will run across more than 2,000 nodes. Due to topology constraints, TPU slices
180-
are usually modelled in DRA as multi-host devices, meaning that a single DRA device
180+
are usually modeled in DRA as multi-host devices, meaning that a single DRA device
181181
can represent thousands of TPUs. As a result, all pods running the workload will
182182
therefore share a single ResourceClaim. The current limit of 256 pods sharing a
183183
ResourceClaim is therefore too low.
@@ -237,7 +237,7 @@ provide guidance as to what is a safe number.
237237
The `ReservedFor` field on the `ResourceClaimStatus` is currently used for two purposes:
238238

239239
#### Deallocation
240-
Devices are allocated to `ResourceClaim`s when the first pod referencing the claim is
240+
Devices are allocated to a `ResourceClaim` when the first pod referencing the claim is
241241
scheduled. Other pods can also share the `ResourceClaim` in which case they share the
242242
devices. Once no pods are consuming the claim, the devices should be deallocated to they
243243
can be allocted to other claims. The `ReservedFor` list is used to keep track of pods
@@ -256,7 +256,7 @@ controller to find pods that are using the ResourceClaim:
256256
1. The DRA scheduler plugin uses the list to find claims that have zero or only
257257
a single pod using it, and is therefore a candidate for deallocation in the `PostFilter` function.
258258

259-
1. The device_taint_eviction controller uses the `ReservedFor` list to find the pods that needs to be evicted
259+
1. The device_taint_eviction controller uses the `ReservedFor` list to find the pods that need to be evicted
260260
when one or more of the devices allocated to a ResourceClaim is tainted (and the ResourceClaim
261261
does not have a toleration).
262262

@@ -284,6 +284,9 @@ type ResourceClaimSpec struct {
284284
// ResourceClaim to remove the reference from the status.ReservedFor list when there
285285
// are no longer any pods consuming the claim.
286286
//
287+
// Most user-created ResourceClaims should not set this field. It is more typically
288+
// used by ResourceClaims created and managed by controllers.
289+
//
287290
// +featureGate=DRAReservedForWorkloads
288291
// +optional
289292
ReservedFor *ResourceClaimConsumerReference
@@ -303,7 +306,7 @@ type AllocationResult struct {
303306
}
304307
```
305308

306-
The `ResourceClaimConsumerReference type already exists:
309+
The `ResourceClaimConsumerReference` type already exists:
307310

308311
```go
309312
// ResourceClaimConsumerReference contains enough information to let you
@@ -345,21 +348,24 @@ list, it will not add a reference to the pod.
345348

346349
##### Deallocation
347350
The resourceclaim controller will remove Pod references from the `ReservedFor` list just
348-
like it does now, but it will never remove references to non-Pod resources. Instead, it
349-
will be the responsibility of the controller/user that created the `ResourceClaim` to
350-
remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
351+
like it does now using the same logic. For non-Pod references, the controller will recognize
352+
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will
353+
remove the reference from the list when those resources are removed. For other types,
354+
it will be the responsibility of the workload controller/user that created the `ResourceClaim`
355+
to remove the reference to the non-Pod resource from the `ReservedFor` list when no pods
351356
are consuming the `ResourceClaim` and no new pods will be created that references
352357
the `ResourceClaim`.
353358

354359
The resourceclaim controller will then discover that the `ReservedFor` list is empty
355360
and therefore know that it is safe to deallocate the `ResourceClaim`.
356361

357-
This requires that the controller/user has permissions to update the status
358-
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if
359-
the resource referenced in the `ReservedFor` list has been deleted from the cluster, but
360-
that requires that the controller has permissions to get or list resources of the type. If the
361-
resourceclaim controller is not able to check, it will just wait until the reference in
362-
the `ReservedFor` list is removed.
362+
This requires that the resourceclaim controller watches the workload types that will
363+
be supported. For other types of workloads, there will be a requirement that the workload
364+
controller has permissions to update the status subresource of the `ResourceClaim`. The
365+
resourceclaim controller will also try to detect if an unknown resource referenced in the
366+
`ReservedFor` list has been deleted from the cluster, but that requires that the controller
367+
has permissions to get or list resources of the type. If the resourceclaim controller is
368+
not able to check, it will just wait until the reference in the `ReservedFor` list is removed.
363369

364370
##### Finding pods using a ResourceClaim
365371
If the reference in the `ReservedFor` list is to a non-Pod resource, controllers can no longer
@@ -444,11 +450,17 @@ Additional e2e tests will be added to verify the behavior added in this KEP.
444450
- Gather feedback from developers and surveys
445451
- Additional tests are in Testgrid and linked in KEP
446452
- Performance impact of the feature has been measured and found to be acceptable
453+
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
454+
- All functionality completed
455+
- All security enforcement completed
456+
- All monitoring requirements completed
457+
- All testing requirements completed
458+
- All known pre-release issues and gaps resolved
447459

448460
#### GA
449461

450-
- 3 examples of real-world usage
451462
- Allowing time for feedback
463+
- All issues and gaps identified as feedback during beta are resolved
452464
- [conformance tests]
453465

454466
[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
@@ -531,8 +543,13 @@ No
531543

532544
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
533545

534-
Yes. Applications that were already running will continue to run and the allocated
546+
Applications that were already running will continue to run and the allocated
535547
devices will remain so.
548+
For the resource types supported directly, the resource claim controller will not remove the
549+
reference in the `ReservedFor` list, meaning the devices will not be deallocated. If the workload
550+
controller is responsible for removing the reference, deallocation will work as long as the
551+
feature isn't also disabled in the controllers. If they are, deallocation will not happen in this
552+
situation either.
536553

537554
###### What happens if we reenable the feature if it was previously rolled back?
538555

@@ -715,7 +732,8 @@ No
715732
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
716733

717734
It might require some additional memory usage in the resourceclaim controller since it will need to keep an index
718-
of ResourceClaim to Pods.
735+
of ResourceClaim to Pods. The resourceclaim controller will also have watches for Deployments, StatefulSets, and
736+
Jobs which might also cause a slight increase in memory usage.
719737

720738
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
721739

0 commit comments

Comments
 (0)