Skip to content

Commit 8832b94

Browse files
authored
Merge pull request #49822 from pohly/dra-admin-attributes-and-taints
DRA: device taints and tolerations
2 parents 649bda2 + 779aeeb commit 8832b94

File tree

3 files changed

+161
-22
lines changed

3 files changed

+161
-22
lines changed

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

Lines changed: 135 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ title: Dynamic Resource Allocation
66
content_type: concept
77
weight: 65
88
api_metadata:
9+
- apiVersion: "resource.k8s.io/v1alpha3"
10+
kind: "DeviceTaintRule"
911
- apiVersion: "resource.k8s.io/v1beta1"
1012
kind: "ResourceClaim"
1113
- apiVersion: "resource.k8s.io/v1beta1"
@@ -14,6 +16,14 @@ api_metadata:
1416
kind: "DeviceClass"
1517
- apiVersion: "resource.k8s.io/v1beta1"
1618
kind: "ResourceSlice"
19+
- apiVersion: "resource.k8s.io/v1beta2"
20+
kind: "ResourceClaim"
21+
- apiVersion: "resource.k8s.io/v1beta2"
22+
kind: "ResourceClaimTemplate"
23+
- apiVersion: "resource.k8s.io/v1beta2"
24+
kind: "DeviceClass"
25+
- apiVersion: "resource.k8s.io/v1beta2"
26+
kind: "ResourceSlice"
1727
---
1828

1929
<!-- overview -->
@@ -48,8 +58,8 @@ v{{< skew currentVersion>}}, check the documentation for that version of Kuberne
4858

4959
## API
5060

51-
The `resource.k8s.io/v1beta1`
52-
{{< glossary_tooltip text="API group" term_id="api-group" >}} provides these types:
61+
The `resource.k8s.io/v1beta1` and `resource.k8s.io/v1beta2`
62+
{{< glossary_tooltip text="API groups" term_id="api-group" >}} provide these types:
5363

5464
ResourceClaim
5565
: Describes a request for access to resources in the cluster,
@@ -71,9 +81,13 @@ DeviceClass
7181
in a ResourceClaim must reference exactly one DeviceClass.
7282

7383
ResourceSlice
74-
: Used by DRA drivers to publish information about resources
84+
: Used by DRA drivers to publish information about resources (typically devices)
7585
that are available in the cluster.
7686

87+
DeviceTaintRule
88+
: Used by admins or control plane components to add device taints
89+
to the devices described in ResourceSlices.
90+
7791
All parameters that select devices are defined in the ResourceClaim and
7892
DeviceClass with in-tree types. Configuration parameters can be embedded there.
7993
Which configuration parameters are valid depends on the DRA driver -- Kubernetes
@@ -94,15 +108,16 @@ Here is an example for a fictional resource driver. Two ResourceClaim objects
94108
will get created for this Pod and each container gets access to one of them.
95109

96110
```yaml
97-
apiVersion: resource.k8s.io/v1beta1
111+
apiVersion: resource.k8s.io/v1beta2
98112
kind: DeviceClass
99-
name: resource.example.com
113+
metadata:
114+
name: resource.example.com
100115
spec:
101116
selectors:
102117
- cel:
103118
expression: device.driver == "resource-driver.example.com"
104119
---
105-
apiVersion: resource.k8s.io/v1beta1
120+
apiVersion: resource.k8s.io/v1beta2
106121
kind: ResourceClaimTemplate
107122
metadata:
108123
name: large-black-cat-claim-template
@@ -111,13 +126,14 @@ spec:
111126
devices:
112127
requests:
113128
- name: req-0
114-
deviceClassName: resource.example.com
115-
selectors:
116-
- cel:
117-
expression: |-
118-
device.attributes["resource-driver.example.com"].color == "black" &&
119-
device.attributes["resource-driver.example.com"].size == "large"
120-
–--
129+
exactly:
130+
deviceClassName: resource.example.com
131+
selectors:
132+
- cel:
133+
expression: |-
134+
device.attributes["resource-driver.example.com"].color == "black" &&
135+
device.attributes["resource-driver.example.com"].size == "large"
136+
---
121137
apiVersion: v1
122138
kind: Pod
123139
metadata:
@@ -219,7 +235,7 @@ admin access grants access to in-use devices and may enable additional
219235
permissions when making the device available in a container:
220236

221237
```yaml
222-
apiVersion: resource.k8s.io/v1beta1
238+
apiVersion: resource.k8s.io/v1beta2
223239
kind: ResourceClaimTemplate
224240
metadata:
225241
name: large-black-cat-claim-template
@@ -228,9 +244,10 @@ spec:
228244
devices:
229245
requests:
230246
- name: req-0
231-
deviceClassName: resource.example.com
232-
allocationMode: All
233-
adminAccess: true
247+
exactly:
248+
deviceClassName: resource.example.com
249+
allocationMode: All
250+
adminAccess: true
234251
```
235252

236253
If this feature is disabled, the `adminAccess` field will be removed
@@ -277,7 +294,7 @@ allocated if it is available. But if it is not and two small white devices are a
277294
the pod will still be able to run.
278295

279296
```yaml
280-
apiVersion: resource.k8s.io/v1beta1
297+
apiVersion: resource.k8s.io/v1beta2
281298
kind: ResourceClaimTemplate
282299
metadata:
283300
name: prioritized-list-claim-template
@@ -327,7 +344,7 @@ handles this and it is transparent to the consumer as the ResourceClaim API is n
327344

328345
```yaml
329346
kind: ResourceSlice
330-
apiVersion: resource.k8s.io/v1beta1
347+
apiVersion: resource.k8s.io/v1beta2
331348
metadata:
332349
name: resourceslice
333350
spec:
@@ -347,21 +364,110 @@ spec:
347364
consumesCounters:
348365
- counterSet: gpu-1-counters
349366
counters:
350-
memory:
367+
memory:
351368
value: 6Gi
352369
- name: device-2
353370
consumesCounters:
354371
- counterSet: gpu-1-counters
355372
counters:
356-
memory:
373+
memory:
357374
value: 6Gi
358375
```
359376

377+
## Device taints and tolerations
378+
379+
{{< feature-state feature_gate_name="DRADeviceTaints" >}}
380+
381+
Device taints are similar to node taints: a taint has a string key, a string
382+
value, and an effect. The effect is applied to the ResourceClaim which is
383+
using a tainted device and to all Pods referencing that ResourceClaim.
384+
The "NoSchedule" effect prevents scheduling those Pods.
385+
Tainted devices are ignored when trying to allocate a ResourceClaim
386+
because using them would prevent scheduling of Pods.
387+
388+
The "NoExecute" effect implies "NoSchedule" and in addition causes eviction
389+
of all Pods which have been scheduled already. This eviction is implemented
390+
in the device taint eviction controller in kube-controller-manager by
391+
deleting affected Pods.
392+
393+
ResourceClaims can tolerate taints. If a taint is tolerated, its effect does
394+
not apply. An empty toleration matches all taints. A toleration can be limited to
395+
certain effects and/or match certain key/value pairs. A toleration can check
396+
that a certain key exists, regardless which value it has, or it can check
397+
for specific values of a key.
398+
For more information on this matching see the
399+
[node taint concepts](/docs/concepts/scheduling-eviction/taint-and-toleration#concepts).
400+
401+
Eviction can be delayed by tolerating a taint for a certain duration.
402+
That delay starts at the time when a taint gets added to a device, which is recorded in a field
403+
of the taint.
404+
405+
Taints apply as described above also to ResourceClaims allocating "all" devices on a node.
406+
All devices must be untainted or all of their taints must be tolerated.
407+
Allocating a device with admin access (described [above](#admin-access))
408+
is not exempt either. An admin using that mode must explicitly tolerate all taints
409+
to access tainted devices.
410+
411+
Taints can be added to devices in two different ways:
412+
413+
### Taints set by the driver
414+
415+
A DRA driver can add taints to the device information that it publishes in ResourceSlices.
416+
Consult the documentation of a DRA driver to learn whether the driver uses taints and what
417+
their keys and values are.
418+
419+
### Taints set by an admin
420+
421+
An admin or a control plane component can taint devices without having to tell
422+
the DRA driver to include taints in its device information in ResourceSlices. They do that by
423+
creating DeviceTaintRules. Each DeviceTaintRule adds one taint to devices which
424+
match the device selector. Without such a selector, no devices are tainted. This
425+
makes it harder to accidentally evict all pods using ResourceClaims when leaving out
426+
the selector by mistake.
427+
428+
Devices can be selected by giving the name of a DeviceClass, driver, pool,
429+
and/or device. The DeviceClass selects all devices that are selected by the
430+
selectors in that DeviceClass. With just the driver name, an admin can taint
431+
all devices managed by that driver, for example while doing some kind of
432+
maintenance of that driver across the entire cluster. Adding a pool name can
433+
limit the taint to a single node, if the driver manages node-local devices.
434+
435+
Finally, adding the device name can select one specific device. The device name
436+
and pool name can also be used alone, if desired. For example, drivers for node-local
437+
devices are encouraged to use the node name as their pool name. Then tainting with
438+
that pool name automatically taints all devices on a node.
439+
440+
Drivers might use stable names like "gpu-0" that hide which specific device is
441+
currently assigned to that name. To support tainting a specific hardware
442+
instance, CEL selectors can be used in a DeviceTaintRule to match a vendor-specific
443+
unique ID attribute, if the driver supports one for its hardware.
444+
445+
The taint applies as long as the DeviceTaintRule exists. It can be modified and
446+
and removed at any time. Here is one example of a DeviceTaintRule for a fictional
447+
DRA driver:
448+
449+
```yaml
450+
apiVersion: resource.k8s.io/v1alpha3
451+
kind: DeviceTaintRule
452+
metadata:
453+
name: example
454+
spec:
455+
# The entire hardware installation for this
456+
# particular driver is broken.
457+
# Evict all pods and don't schedule new ones.
458+
deviceSelector:
459+
driver: dra.example.com
460+
taint:
461+
key: dra.example.com/unhealthy
462+
value: Broken
463+
effect: NoExecute
464+
```
465+
360466
## Enabling dynamic resource allocation
361467

362468
Dynamic resource allocation is a *beta feature* which is off by default and only enabled when the
363469
`DynamicResourceAllocation` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
364-
and the `resource.k8s.io/v1beta1` {{< glossary_tooltip text="API group" term_id="api-group" >}}
470+
and the `resource.k8s.io/v1beta1` and `resource.k8s.io/v1beta2` {{< glossary_tooltip text="API groups" term_id="api-group" >}}
365471
are enabled. For details on that, see the `--feature-gates` and `--runtime-config`
366472
[kube-apiserver parameters](/docs/reference/command-line-tools-reference/kube-apiserver/).
367473
kube-scheduler, kube-controller-manager and kubelet also need the feature gate.
@@ -426,6 +532,13 @@ and only enabled when the `DRAPartitionableDevices`
426532
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
427533
is enabled in the kube-apiserver and kube-scheduler.
428534

535+
### Enabling device taints and tolerations
536+
537+
[Device taints and tolerations](#device-taints-and-tolerations) is an *alpha feature* and only enabled when the
538+
`DRADeviceTaints` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
539+
is enabled in the kube-apiserver, kube-controller-manager and kube-scheduler. To use DeviceTaintRules, the
540+
`resource.k8s.io/v1alpha3` API version must be enabled.
541+
429542
## {{% heading "whatsnext" %}}
430543

431544
- For more information on the design, see the

content/en/docs/concepts/scheduling-eviction/taint-and-toleration.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -322,9 +322,18 @@ tolerations to all daemons, to prevent DaemonSets from breaking.
322322
Adding these tolerations ensures backward compatibility. You can also add
323323
arbitrary tolerations to DaemonSets.
324324

325+
## Device taints and tolerations
326+
327+
Instead of tainting entire nodes, administrators can also [taint individual devices](/docs/concepts/scheduling-eviction/dynamic-resource-allocation#device-taints-and-tolerations)
328+
when the cluster uses [dynamic resource allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
329+
to manage special hardware. The advantage is that tainting can be targeted towards exactly the hardware that
330+
is faulty or needs maintenance. Tolerations are also supported and can be specified when requesting
331+
devices. Like taints they apply to all pods which share the same allocated device.
332+
325333
## {{% heading "whatsnext" %}}
326334

327335
* Read about [Node-pressure Eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/)
328336
and how you can configure it
329337
* Read about [Pod Priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
338+
* Read about [device taints and tolerations](/docs/concepts/scheduling-eviction/dynamic-resource-allocation#device-taints-and-tolerations)
330339

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: DRADeviceTaints
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.33"
12+
---
13+
Enables support for
14+
[tainting devices and selectively tolerating those taints](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
15+
when using dynamic resource allocation to manage devices.
16+
17+
This feature gate has no effect unless you also enable the `DynamicResourceAllocation` feature gate.

0 commit comments

Comments
 (0)