Skip to content

Commit c239f4a

Browse files
authored
Merge pull request kubernetes#4667 from pohly/dra-kubelet-no-api-dependency
KEP-4318: DRA: avoid kubelet API version dependency
2 parents bbcd302 + b01f7f3 commit c239f4a

File tree

1 file changed

+99
-78
lines changed
  • keps/sig-node/4381-dra-structured-parameters

1 file changed

+99
-78
lines changed

keps/sig-node/4381-dra-structured-parameters/README.md

Lines changed: 99 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ SIG Architecture for cross-cutting KEPs).
7878
- [Risks and Mitigations](#risks-and-mitigations)
7979
- [Feature not used](#feature-not-used)
8080
- [Compromised node](#compromised-node)
81-
- [Compromised resource driver plugin](#compromised-resource-driver-plugin)
81+
- [Compromised kubelet plugin](#compromised-kubelet-plugin)
8282
- [User permissions and quotas](#user-permissions-and-quotas)
8383
- [Usability](#usability)
8484
- [Design Details](#design-details)
@@ -110,9 +110,10 @@ SIG Architecture for cross-cutting KEPs).
110110
- [PreBind](#prebind)
111111
- [Unreserve](#unreserve)
112112
- [kubelet](#kubelet)
113-
- [Managing resources](#managing-resources)
114113
- [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin)
115-
- [NodeListAndWatchResources](#nodelistandwatchresources)
114+
- [Version skew](#version-skew)
115+
- [Security](#security)
116+
- [Managing resources](#managing-resources)
116117
- [NodePrepareResource](#nodeprepareresource)
117118
- [NodeUnprepareResources](#nodeunprepareresources)
118119
- [Simulation with CA](#simulation-with-ca)
@@ -526,25 +527,35 @@ In production, a similar PodTemplateSpec in a Deployment will be used.
526527
### Publishing node resources
527528

528529
The resources available on a node need to be published to the API server. In
529-
the typical case, this is expected to be published by the on-node driver via
530-
the kubelet, as described below. However, the source of this data may vary; for
530+
the typical case, this is expected to be published by the on-node driver
531+
as described in the next paragraph. However, the source of this data may vary; for
531532
example, a cloud provider controller could populate this based upon information
532533
from the cloud provider API.
533534

534-
In the kubelet case, each kubelet publishes kubelet publishes a set of
535-
`ResourceSlice` objects to the API server with content provided by the
536-
corresponding DRA drivers running on its node. Access control through the node
537-
authorizer ensures that the kubelet running on one node is not allowed to
538-
create or modify `ResourceSlices` belonging to another node. A `nodeName`
539-
field in each `ResourceSlice` object is used to determine which objects are
540-
managed by which kubelet.
541-
542-
**NOTE:** `ResourceSlices` are published separately for each driver, using
543-
whatever version of the `resource.k8s.io` API is supported by the kubelet. That
544-
same version is then also used in the gRPC interface between the kubelet and
545-
the DRA drivers providing content for those objects. It might be possible to
546-
support version skew (= keeping kubelet at an older version than the control
547-
plane and the DRA drivers) in the future, but currently this is out of scope.
535+
In the kubelet case, each driver running on a node publishes a set of
536+
`ResourceSlice` objects to the API server for its own resources, using its
537+
connection to the apiserver. Access control through a validating admission
538+
policy can ensure that the drivers running on one node are not allowed to
539+
create or modify `ResourceSlices` belonging to another node. The `nodeName`
540+
and `driverName` fields in each `ResourceSlice` object are used to determine which objects are
541+
managed by which driver instance. The owner reference ensures that objects
542+
beloging to a node get cleaned up when the node gets removed.
543+
544+
In addition, whenever kubelet starts, it first deletes all `ResourceSlices`
545+
belonging to the node with a `DeleteCollection` call that uses the node name in
546+
a field filter. This ensures that no pods depending in DRA get scheduled to the
547+
node until the required DRA drivers have started up again (node reboot) and
548+
reconnected to kubelet (kubelet restart). It also ensures that drivers which
549+
don't get started up again at all don't leave stale `ResourceSlices`
550+
behind. Garbage collection does not help in this case because the node object
551+
still exists. For the same reasons, the ResourceSlices belonging to a driver
552+
get removed when the driver unregisters, this time with a field filter for node
553+
name and driver name.
554+
555+
Deleting `ResourceSlices` is possible because all information in them can be
556+
reconstructed by the driver. This has no effect on already allocated claims
557+
because the allocation result is tracked in those claims, not the
558+
`ResourceSlice` objects (see [below](#state-and-communication)).
548559

549560
Embedded inside each `ResourceSlice` is the representation of the resources
550561
managed by a driver according to a specific "structured model". In the example
@@ -842,7 +853,7 @@ return quickly without doing any work for pods.
842853

843854
#### Compromised node
844855

845-
Kubelet is intentionally limited to read-only access for ResourceClass and ResourceClaim
856+
Kubelet is intentionally limited to read-only access for ResourceClaim
846857
to prevent that a
847858
compromised kubelet interferes with scheduling of pending pods, for example
848859
by updating status information normally set by the scheduler.
@@ -863,22 +874,24 @@ allocation, such an attack is still possible, but the attack code would have to
863874
be different for each resource driver because all of them will use structured
864875
parameters differently for reporting resource availability.
865876

866-
#### Compromised resource driver plugin
877+
#### Compromised kubelet plugin
867878

868879
This is the result of an attack against the resource driver, either from a
869880
container which uses a resource exposed by the driver, a compromised kubelet
870881
which interacts with the plugin, or due to resource driver running on a node
871882
with a compromised root account.
872883

873-
The resource driver plugin only needs read access to objects described in this
874-
KEP, so compromising it does not interfere with dynamic resource allocation for
875-
other drivers.
884+
The resource driver needs write access for ResourceSlices. It can be deployed so
885+
that it can only write objects associated with the node, so the impact of a
886+
compromise would be limited to the node. Other drivers on the node could also
887+
be impacted because there is no separation by driver.
876888

877-
A resource driver may need root access on the node to manage
889+
However, a resource driver may need root access on the node to manage
878890
hardware. Attacking the driver therefore may lead to root privilege
879891
escalation. Ideally, driver authors should try to avoid depending on root
880892
permissions and instead use capabilities or special permissions for the kernel
881-
APIs that they depend on.
893+
APIs that they depend on. Long term, limiting apiserver access by driver
894+
name would be useful.
882895

883896
A resource driver may also need privileged access to remote services to manage
884897
network-attached devices. Resource driver vendors and cluster administrators
@@ -931,7 +944,7 @@ Several components must be implemented or modified in Kubernetes:
931944
ResourceClaim (directly or through a template) and ensure that the
932945
resource is allocated before the Pod gets scheduled, similar to
933946
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go
934-
- Kubelet must be extended to retrieve information from ResourceClaims
947+
- Kubelet must be extended to manage ResourceClaims
935948
and to call a resource kubelet plugin. That plugin returns CDI device ID(s)
936949
which then must be passed to the container runtime.
937950

@@ -1188,13 +1201,13 @@ drivers are expected to be written for Kubernetes.
11881201
11891202
##### ResourceSlice
11901203
1191-
For each node, one or more ResourceSlice objects get created. The kubelet
1192-
publishes them with the node as the owner, so they get deleted when a node goes
1204+
For each node, one or more ResourceSlice objects get created. The drivers
1205+
on a node publish them with the node as the owner, so they get deleted when a node goes
11931206
down and then gets removed.
11941207
11951208
All list types are atomic because that makes tracking the owner for
11961209
server-side-apply (SSA) simpler. Patching individual list elements is not
1197-
needed and there is a single owner (kubelet).
1210+
needed and there is a single owner.
11981211
11991212
```go
12001213
// ResourceSlice provides information about available
@@ -2049,6 +2062,53 @@ Unreserve is called in two scenarios:
20492062

20502063
### kubelet
20512064

2065+
#### Communication between kubelet and resource kubelet plugin
2066+
2067+
Resource kubelet plugins are discovered through the [kubelet plugin registration
2068+
mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A
2069+
new "ResourcePlugin" type will be used in the Type field of the
2070+
[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo)
2071+
response to distinguish the plugin from device and CSI plugins.
2072+
2073+
Under the advertised Unix Domain socket the kubelet plugin provides the
2074+
k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2075+
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
2076+
with “volume” replaced by “resource” and volume specific parts removed.
2077+
2078+
#### Version skew
2079+
2080+
Previously, kubelet retrieved ResourceClaims and published ResourceSlices on
2081+
behalf of DRA drivers on the node. The information included in those got passed
2082+
between API server, kubelet, and kubelet plugin using the version of the
2083+
resource.k8s.io used by the kubelet. Combining a kubelet using some older API
2084+
version with a plugin using a new version was not possible because conversion
2085+
of the resource.k8s.io types is only supported in the API server and an old
2086+
kubelet wouldn't know about a new version anyway.
2087+
2088+
Keeping kubelet at some old release while upgrading the control and DRA drivers
2089+
is desirable and officially supported by Kubernetes. To support the same when
2090+
using DRA, the kubelet now leaves [ResourceSlice
2091+
handling](#publishing-node-resources) almost entirely to the plugins. The
2092+
remaining calls are done with whatever resource.k8s.io API version is the
2093+
latest known to the kubelet. To support version skew, support for older API
2094+
versions must be preserved as far back as support for older kubelet releases is
2095+
desired.
2096+
2097+
#### Security
2098+
2099+
The daemonset of a DRA driver must be configured to have a service account
2100+
which grants the following permissions:
2101+
- get/list/watch/create/update/oatch/delete ResourceSlice
2102+
- get ResourceClaim
2103+
- get Node
2104+
2105+
Ideally, write access to ResourceSlice should be limited to objects belonging
2106+
to the node. This is possible with a [validating admission
2107+
policy](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/). As
2108+
this is not a core feature of the DRA KEP, instructions for how to do that will
2109+
not be included here. Instead, the DRA example driver will provide an example
2110+
and documentation.
2111+
20522112
#### Managing resources
20532113

20542114
kubelet must ensure that resources are ready for use on the node before running
@@ -2068,53 +2128,18 @@ successfully before allowing the pod to be deleted. This ensures that network-at
20682128
for other Pods, including those that might get scheduled to other nodes. It
20692129
also signals that it is safe to deallocate and delete the ResourceClaim.
20702130

2071-
20722131
![kubelet](./kubelet.png)
20732132

2074-
#### Communication between kubelet and resource kubelet plugin
2075-
2076-
Resource kubelet plugins are discovered through the [kubelet plugin registration
2077-
mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A
2078-
new "ResourcePlugin" type will be used in the Type field of the
2079-
[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo)
2080-
response to distinguish the plugin from device and CSI plugins.
2081-
2082-
Under the advertised Unix Domain socket the kubelet plugin provides the
2083-
k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2084-
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
2085-
with “volume” replaced by “resource” and volume specific parts removed.
2086-
2087-
##### NodeListAndWatchResources
2088-
2089-
NodeListAndWatchResources returns a stream of NodeResourcesResponse objects.
2090-
At the start and whenever resource availability changes, the
2091-
plugin must send one such object with all information to the kubelet. The
2092-
kubelet then syncs that information with ResourceSlice objects.
2093-
2094-
```
2095-
message NodeListAndWatchResourcesRequest {
2096-
}
2097-
2098-
message NodeListAndWatchResourcesResponse {
2099-
repeated k8s.io.api.resource.v1alpha2.ResourceModel resources = 1;
2100-
}
2101-
```
2102-
21032133
##### NodePrepareResource
21042134

21052135
This RPC is called by the kubelet when a Pod that wants to use the specified
21062136
resource is scheduled on a node. The Plugin SHALL assume that this RPC will be
21072137
executed on the node where the resource will be used.
21082138

2109-
ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name and
2110-
one of the ResourceHandles from the ResourceClaimStatus.AllocationResult with
2111-
a matching DriverName should be passed to the Plugin as parameters to identify
2139+
ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name are
2140+
passed to the Plugin as parameters to identify
21122141
the claim and perform resource preparation.
21132142

2114-
ResourceClaim parameters (namespace, UUID, name) are useful for debugging.
2115-
They enable the Plugin to retrieve the full ResourceClaim object, should it
2116-
ever be needed (normally it shouldn't).
2117-
21182143
The Plugin SHALL return fully qualified device name[s].
21192144

21202145
The Plugin SHALL ensure that there are json file[s] in CDI format
@@ -2155,20 +2180,16 @@ message Claim {
21552180
// The name of the Resource claim (ResourceClaim.meta.Name)
21562181
// This field is REQUIRED.
21572182
string name = 3;
2158-
// Resource handle (AllocationResult.ResourceHandles[*].Data)
2159-
// This field is OPTIONAL.
2160-
string resource_handle = 4;
2161-
// Structured parameter resource handle (AllocationResult.ResourceHandles[*].StructuredData).
2162-
// This field is OPTIONAL. If present, it needs to be used
2163-
// instead of resource_handle. It will only have a single entry.
2164-
//
2165-
// Using "repeated" instead of "optional" is a workaround for https://github.com/gogo/protobuf/issues/713.
2166-
repeated k8s.io.api.resource.v1alpha2.StructuredResourceHandle structured_resource_handle = 5;
21672183
}
21682184
```
21692185

2170-
`resource_handle` and `structured_resource_handle` will be set depending on how
2171-
the claim was allocated. See also KEP #3063.
2186+
The allocation result is intentionally not included here. The content of that
2187+
field is version-dependent. The kubelet would need to discover in which version
2188+
each plugin wants the data, then potentially get the claim multiple times
2189+
because only the apiserver can convert between versions. Instead, each plugin
2190+
is required to get the claim itself using its own credentials. In the most common
2191+
case of one plugin per claim, that doubles the number of GETs for each claim
2192+
(once by the kubelet, once by the plugin).
21722193

21732194
```
21742195
message NodePrepareResourcesResponse {

0 commit comments

Comments
 (0)