Skip to content

Commit ab72f61

Browse files
authored
Merge pull request #4709 from pohly/dra-update-1.31
KEP-4381: DRA update for 1.31
2 parents d673b64 + c010dff commit ab72f61

File tree

2 files changed

+1038
-973
lines changed
  • keps/sig-node

2 files changed

+1038
-973
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

Lines changed: 77 additions & 140 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,10 @@ SIG Architecture for cross-cutting KEPs).
7979
- [Combined setup of different hardware functions](#combined-setup-of-different-hardware-functions)
8080
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
8181
- [Design Details](#design-details)
82-
- [ResourceClass extension](#resourceclass-extension)
8382
- [ResourceClaim extension](#resourceclaim-extension)
8483
- [ResourceClaimStatus extension](#resourceclaimstatus-extension)
85-
- [ResourceHandle extensions](#resourcehandle-extensions)
84+
- [DeviceClass extensions](#deviceclass-extensions)
85+
- [Custom parameters and results](#custom-parameters-and-results)
8686
- [PodSchedulingContext](#podschedulingcontext)
8787
- [Coordinating resource allocation through the scheduler](#coordinating-resource-allocation-through-the-scheduler)
8888
- [Resource allocation and usage flow](#resource-allocation-and-usage-flow)
@@ -247,110 +247,29 @@ so this is not a major concern.
247247

248248
## Design Details
249249

250-
### ResourceClass extension
250+
### ResourceClaim extension
251251

252-
An optional field in ResourceClass enables using the DRA driver's control
253-
plane controller:
252+
When allocation through a DRA driver is required, users have to ask for it by
253+
specifying the name of the driver which should handle the allocation.
254254

255-
```go
256-
type ResourceClass struct {
255+
```
256+
type ResourceClaimSpec struct {
257257
...
258258
259-
// ControllerName defines the name of the dynamic resource driver that is
260-
// used for allocation of a ResourceClaim that uses this class. If empty,
261-
// structured parameters are used for allocating claims using this class.
259+
// ControllerName defines the name of the DRA driver that is meant
260+
// to handle allocation of this claim. If empty, allocation is handled
261+
// by the scheduler while scheduling a pod.
262262
//
263-
// Resource drivers have a unique name in forward domain order
264-
// (acme.example.com).
263+
// Must be a DNS subdomain and should end with a DNS domain owned by the
264+
// vendor of the driver.
265265
//
266266
// This is an alpha field and requires enabling the DRAControlPlaneController
267267
// feature gate.
268268
//
269269
// +optional
270270
ControllerName string
271-
}
272-
```
273-
274-
### ResourceClaim extension
275-
276-
With structured parameters, allocation always happens only when a pod needs a
277-
ResourceClaim ("delayed allocation"). With allocation through the driver, it
278-
may also make sense to allocate a ResourceClaim as soon as it gets created
279-
("immediate allocation").
280-
281-
Immediate allocation is useful when allocating a resource is expensive (for
282-
example, programming an FPGA) and the resource therefore is meant to be used by
283-
multiple different Pods, either in parallel or one after the other. Another use
284-
case is managing resource allocation in a third-party component which fully
285-
understands optimal placement of everything that needs to run on a certain
286-
cluster.
287-
288-
The downside is that Pod resource requirements cannot be considered when choosing
289-
where to allocate. If a resource was allocated so that it is only available on
290-
one node and the Pod cannot run there because other resources like RAM or CPU
291-
are exhausted on that node, then the Pod cannot run elsewhere. The same applies
292-
to resources that are available on a certain subset of the nodes and those
293-
nodes are busy.
294-
295-
Different lifecycles of a ResourceClaim can be combined with different allocation modes
296-
arbitrarily. Some combinations are more useful than others:
297-
298-
```
299-
+-----------+----------------------------------------------------------------------+
300-
| | allocation mode |
301-
| lifecycle | immediate | delayed |
302-
+-----------+------------------------------------+---------------------------------+
303-
| regular | starts the potentially | avoids wasting resources |
304-
| claim | slow allocation as soon | while they are not needed yet |
305-
| | as possible | |
306-
+-----------+------------------------------------+---------------------------------+
307-
| claim | same benefit as above, | resource allocated when needed, |
308-
| template | but ignores other pod constraints | allocation coordinated by |
309-
| | during allocation | scheduler |
310-
+-----------+------------------------------------+---------------------------------+
311-
```
312271
313-
```
314-
type ResourceClaimSpec struct {
315272
...
316-
317-
// Allocation can start immediately or when a Pod wants to use the
318-
// resource. "WaitForFirstConsumer" is the default.
319-
// +optional
320-
//
321-
// This is an alpha field and requires enabling the DRAControlPlaneController
322-
// feature gate.
323-
AllocationMode AllocationMode
324-
}
325-
326-
// AllocationMode describes whether a ResourceClaim gets allocated immediately
327-
// when it gets created (AllocationModeImmediate) or whether allocation is
328-
// delayed until it is needed for a Pod
329-
// (AllocationModeWaitForFirstConsumer). Other modes might get added in the
330-
// future.
331-
type AllocationMode string
332-
333-
const (
334-
// When a ResourceClaim has AllocationModeWaitForFirstConsumer, allocation is
335-
// delayed until a Pod gets scheduled that needs the ResourceClaim. The
336-
// scheduler will consider all resource requirements of that Pod and
337-
// trigger allocation for a node that fits the Pod.
338-
//
339-
// The ResourceClaim gets deallocated as soon as it is not in use anymore.
340-
AllocationModeWaitForFirstConsumer AllocationMode = "WaitForFirstConsumer"
341-
342-
// When a ResourceClaim has AllocationModeImmediate and the ResourceClass
343-
// uses a control plane controller, allocation starts
344-
// as soon as the ResourceClaim gets created. This is done without
345-
// considering the needs of Pods that will use the ResourceClaim
346-
// because those Pods are not known yet.
347-
//
348-
// When structured parameters are used, nothing special is done for
349-
// allocation and thus allocation happens when the scheduler handles
350-
// first Pod which needs the ResourceClaim, as with "WaitForFirstConsumer".
351-
//
352-
// In both cases, claims remain allocated even when not in use.
353-
AllocationModeImmediate AllocationMode = "Immediate"
354273
)
355274
```
356275

@@ -359,30 +278,45 @@ const (
359278
```
360279
type ResourceClaimStatus struct {
361280
...
362-
// ControllerName is a copy of the driver name from the ResourceClass at
363-
// the time when allocation started. It is empty when the claim was
364-
// allocated through structured parameters,
281+
282+
Allocation *AllocationResult // same as in #4381
283+
284+
// Indicates that a claim is to be deallocated. While this is set,
285+
// no new consumers may be added to ReservedFor.
286+
//
287+
// This is only used if the claim needs to be deallocated by a DRA driver.
288+
// That driver then must deallocate this claim and reset the field
289+
// together with clearing the Allocation field.
365290
//
366291
// This is an alpha field and requires enabling the DRAControlPlaneController
367292
// feature gate.
368293
//
369294
// +optional
370-
ControllerName string
295+
DeallocationRequested bool
371296
372-
// DeallocationRequested indicates that a ResourceClaim is to be
373-
// deallocated.
374-
//
375-
// The driver then must deallocate this claim and reset the field
376-
// together with clearing the Allocation field.
297+
...
298+
}
299+
300+
type AllocationResult struct {
301+
...
302+
303+
// ControllerName is the name of the DRA driver which handled the
304+
// allocation. That driver is also responsible for deallocating the
305+
// claim. It is empty when the claim can be deallocated without
306+
// involving a driver.
377307
//
378-
// While DeallocationRequested is set, no new consumers may be added to
379-
// ReservedFor.
308+
// A driver may allocate devices provided by other drivers, so this
309+
// driver name here can be different from the driver names listed for
310+
// the results.
380311
//
381312
// This is an alpha field and requires enabling the DRAControlPlaneController
382313
// feature gate.
383314
//
384315
// +optional
385-
DeallocationRequested bool
316+
ControllerName string
317+
318+
...
319+
}
386320
```
387321

388322
DeallocationRequested gets set by the scheduler when it detects
@@ -393,51 +327,53 @@ cannot be allocated because that node ran out of resources for those.
393327
It also gets set by kube-controller-manager when it detects that
394328
a claim is no longer in use.
395329

396-
### ResourceHandle extensions
330+
### DeviceClass extensions
397331

398-
Resource drivers can use each `ResourceHandle` to store data directly or
399-
cross-reference some other place where information is stored.
400-
This data is guaranteed to be available when a Pod is about
401-
to run on a node, in contrast to the ResourceClass which
402-
may have been deleted in the meantime. It's also protected from
403-
modification by a user, in contrast to an annotation.
332+
In cases where a driver manages resources only on a small subset of the nodes
333+
in the cluster it is useful to inform the scheduler about that up-front because
334+
it helps narrow down the search for suitable nodes. This information can be
335+
placed in a DeviceClass when the admin deploys the DRA driver. This is an optional
336+
optimization.
404337

405-
```
406-
// ResourceHandle holds opaque resource data for processing by a specific kubelet plugin.
407-
type ResourceHandle struct {
338+
```go
339+
type DeviceClass struct {
408340
...
409341

410-
// Data contains the opaque data associated with this ResourceHandle. It is
411-
// set by the controller component of the resource driver whose name
412-
// matches the DriverName set in the ResourceClaimStatus this
413-
// ResourceHandle is embedded in. It is set at allocation time and is
414-
// intended for processing by the kubelet plugin whose name matches
415-
// the DriverName set in this ResourceHandle.
342+
// Only Nodes matching the selector will be considered by the scheduler
343+
// when trying to find a Node that fits a Pod when that Pod uses
344+
// a claim that has not been allocated yet *and* that claim
345+
// gets allocated through a control plane controller. It is ignored
346+
// when the claim does not use a control plane controller
347+
// for allocation.
416348
//
417-
// The maximum size of this field is 16KiB. This may get increased in the
418-
// future, but not reduced.
349+
// Setting this field is optional. If unset, all Nodes are candidates.
419350
//
420-
// This is an alpha field and requires enabling the DRAControlPlaneController feature gate.
351+
// This is an alpha field and requires enabling the DRAControlPlaneController
352+
// feature gate.
421353
//
422354
// +optional
423-
Data string
424-
}
355+
SuitableNodes *v1.NodeSelector
425356

426-
// ResourceHandleDataMaxSize represents the maximum size of resourceHandle.data.
427-
const ResourceHandleDataMaxSize = 16 * 1024
357+
...
358+
}
428359
```
429360

361+
### Custom parameters and results
362+
363+
DRA drivers have to use the API as defined in KEP #4381. They can use the
364+
config fields to receive additional parameters and to convey information to
365+
their kubelet plugin.
430366

431367
### PodSchedulingContext
432368

433369
PodSchedulingContexts get created by a scheduler when it processes a pod which
434-
uses one or more unallocated ResourceClaims with delayed allocation and
370+
uses one or more unallocated ResourceClaims where
435371
allocation of those ResourceClaims is handled by control plane controllers.
436372

437373
```
438374
// PodSchedulingContext holds information that is needed to schedule
439-
// a Pod with ResourceClaims that use "WaitForFirstConsumer" allocation
440-
// mode.
375+
// a Pod with ResourceClaims that use a control plane controller
376+
// for allocation.
441377
//
442378
// This is an alpha type and requires enabling the DynamicResourceAllocation
443379
// and DRAControlPlaneController feature gates.
@@ -587,18 +523,12 @@ other ResourceClaim until a node gets selected by the scheduler.
587523

588524
### Coordinating resource allocation through the scheduler
589525

590-
For immediate allocation, scheduling Pods is simple because the
591-
resource is already allocated and determines the nodes on which the
592-
Pod may run. The downside is that pod scheduling is less flexible.
593-
594-
For delayed allocation, a node is selected tentatively by the scheduler
526+
A node is selected tentatively by the scheduler
595527
in an iterative process where the scheduler suggests some potential nodes
596528
that fit the other resource requirements of a Pod and resource drivers
597529
respond with information about whether they can allocate claims for those
598530
nodes. This exchange of information happens through the `PodSchedulingContext`
599-
for a Pod. The scheduler has to involve the drivers because it
600-
doesn't know what claim parameters mean and where suitable resources are
601-
currently available.
531+
for a Pod.
602532

603533
Once the scheduler is confident that it has enough information to select
604534
a node that will probably work for all claims, it asks the driver(s) to
@@ -768,7 +698,7 @@ to understand the parameters for the claim and available capacity in order
768698
to simulate the effect of allocating claims as part of scheduling and of
769699
creating or removing nodes.
770700

771-
This is not possible with opaque parameters as described in this KEP. If a DRA
701+
This is not possible when a control plane controller interprets parameters. If a DRA
772702
driver developer wants to support Cluster Autoscaler, they have to use
773703
structured parameters as defined in [KEP
774704
#4381](https://github.com/kubernetes/enhancements/issues/4381).
@@ -993,6 +923,11 @@ resources were not requested.
993923
- kube-controller-manager
994924
- kube-scheduler
995925
- kubelet
926+
- Feature gate name: DRAControlPlaneController
927+
- Components depending on the feature gate:
928+
- kube-apiserver
929+
- kube-controller-manager
930+
- kube-scheduler
996931

997932
###### Does enabling the feature change any default behavior?
998933

@@ -1283,6 +1218,8 @@ instructions.
12831218
a template are generated instead of deterministic), scheduler performance
12841219
enhancements (no more backoff delays).
12851220
- Kubernetes 1.29, 1.30: most blocking API calls moved into Pod binding goroutine
1221+
- Kubernetes 1.31: v1alpha3 with a new API (removal of support for immediate
1222+
allocation and for CRDs as claim parameters)
12861223

12871224
## Drawbacks
12881225

0 commit comments

Comments
 (0)