Skip to content

Commit 75e16b2

Browse files
committed
DRA: avoid kubelet API version dependency
By requiring that drivers on a node connect to the apiserver directly it becomes possible to update drivers and the control plane to a newer release than the kubelet because the kubelet doesn't need to encode/decode the resource.k8s.io API types that are used by the drivers. This simplifies cluster upgrades.
1 parent e61fc51 commit 75e16b2

File tree

1 file changed

+73
-68
lines changed
  • keps/sig-node/4381-dra-structured-parameters

1 file changed

+73
-68
lines changed

keps/sig-node/4381-dra-structured-parameters/README.md

Lines changed: 73 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,10 @@ SIG Architecture for cross-cutting KEPs).
110110
- [PreBind](#prebind)
111111
- [Unreserve](#unreserve)
112112
- [kubelet](#kubelet)
113-
- [Managing resources](#managing-resources)
114113
- [Communication between kubelet and resource kubelet plugin](#communication-between-kubelet-and-resource-kubelet-plugin)
115-
- [NodeListAndWatchResources](#nodelistandwatchresources)
114+
- [Version skew](#version-skew)
115+
- [Security](#security)
116+
- [Managing resources](#managing-resources)
116117
- [NodePrepareResource](#nodeprepareresource)
117118
- [NodeUnprepareResources](#nodeunprepareresources)
118119
- [Simulation with CA](#simulation-with-ca)
@@ -531,20 +532,13 @@ the kubelet, as described below. However, the source of this data may vary; for
531532
example, a cloud provider controller could populate this based upon information
532533
from the cloud provider API.
533534

534-
In the kubelet case, each kubelet publishes kubelet publishes a set of
535-
`ResourceSlice` objects to the API server with content provided by the
536-
corresponding DRA drivers running on its node. Access control through the node
537-
authorizer ensures that the kubelet running on one node is not allowed to
538-
create or modify `ResourceSlices` belonging to another node. A `nodeName`
539-
field in each `ResourceSlice` object is used to determine which objects are
540-
managed by which kubelet.
541-
542-
**NOTE:** `ResourceSlices` are published separately for each driver, using
543-
whatever version of the `resource.k8s.io` API is supported by the kubelet. That
544-
same version is then also used in the gRPC interface between the kubelet and
545-
the DRA drivers providing content for those objects. It might be possible to
546-
support version skew (= keeping kubelet at an older version than the control
547-
plane and the DRA drivers) in the future, but currently this is out of scope.
535+
In the kubelet case, each driver running on a node publishes a set of
536+
`ResourceSlice` objects to the API server for its own resources, using its
537+
connection to the apiserver. Access control through a validating admission
538+
policy can ensure that the drivers running on one node are not allowed to
539+
create or modify `ResourceSlices` belonging to another node. The `nodeName`
540+
and `driverName` fields in each `ResourceSlice` object are used to determine which objects are
541+
managed by which driver instance.
548542

549543
Embedded inside each `ResourceSlice` is the representation of the resources
550544
managed by a driver according to a specific "structured model". In the example
@@ -931,7 +925,7 @@ Several components must be implemented or modified in Kubernetes:
931925
ResourceClaim (directly or through a template) and ensure that the
932926
resource is allocated before the Pod gets scheduled, similar to
933927
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go
934-
- Kubelet must be extended to retrieve information from ResourceClaims
928+
- Kubelet must be extended to manage ResourceClaims
935929
and to call a resource kubelet plugin. That plugin returns CDI device ID(s)
936930
which then must be passed to the container runtime.
937931

@@ -1188,13 +1182,13 @@ drivers are expected to be written for Kubernetes.
11881182
11891183
##### ResourceSlice
11901184
1191-
For each node, one or more ResourceSlice objects get created. The kubelet
1192-
publishes them with the node as the owner, so they get deleted when a node goes
1185+
For each node, one or more ResourceSlice objects get created. The drivers
1186+
on a node publish them with the node as the owner, so they get deleted when a node goes
11931187
down and then gets removed.
11941188
11951189
All list types are atomic because that makes tracking the owner for
11961190
server-side-apply (SSA) simpler. Patching individual list elements is not
1197-
needed and there is a single owner (kubelet).
1191+
needed and there is a single owner.
11981192
11991193
```go
12001194
// ResourceSlice provides information about available
@@ -2049,6 +2043,56 @@ Unreserve is called in two scenarios:
20492043

20502044
### kubelet
20512045

2046+
#### Communication between kubelet and resource kubelet plugin
2047+
2048+
Resource kubelet plugins are discovered through the [kubelet plugin registration
2049+
mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A
2050+
new "ResourcePlugin" type will be used in the Type field of the
2051+
[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo)
2052+
response to distinguish the plugin from device and CSI plugins.
2053+
2054+
Under the advertised Unix Domain socket the kubelet plugin provides the
2055+
k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2056+
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
2057+
with “volume” replaced by “resource” and volume specific parts removed.
2058+
2059+
#### Version skew
2060+
2061+
Previously, kubelet retrieved ResourceClaims and published ResourceSlices on
2062+
behalf of DRA drivers on the node. The information included in those got passed
2063+
between API server, kubelet, and kubelet plugin using the version of the
2064+
resource.k8s.io used by the kubelet. Combining a kubelet using some older API
2065+
version with a plugin using a new version was not possible because conversion
2066+
of the resource.k8s.io types is only supported in the API server and an old
2067+
kubelet wouldn't know about a new version anyway.
2068+
2069+
Keeping kubelet at some old release while upgrading the control and DRA drivers
2070+
is desirable and officially supported by Kubernetes. To support the same when
2071+
using DRA, the kubelet now leaves ResourceSlice handling (almost) entirely to
2072+
the plugins. The one exception is that it deletes all ResourceSlices on
2073+
startup. This ensures that no pods depending in DRA get scheduled to the node
2074+
until the required DRA drivers have started up again. It also ensures that
2075+
drivers which don't get started up again at all don't leave stale
2076+
ResourceSlices behind. For the same reasons, the ResourceSlices belonging to a
2077+
driver get removed when the driver unregisters. This access is done with
2078+
whatever resource.k8s.io API version is the latest known to the kubelet. To
2079+
support version skew, support for older API versions must be preserved as far
2080+
back as support for older kubelet releases is desired.
2081+
2082+
#### Security
2083+
2084+
The daemonset of a DRA driver must be configured to have a service account
2085+
which grants the following permissions:
2086+
- read/write/patch ResourceSlice
2087+
- read ResourceClaim
2088+
2089+
Ideally, write access to ResourceSlice should be limited to objects belonging
2090+
to the node. This is possible with a [validating admission
2091+
policy](https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/). As
2092+
this is not a core feature of the DRA KEP, instructions for how to do that will
2093+
not be included here. Instead, the DRA example driver will provide an example
2094+
and documentation.
2095+
20522096
#### Managing resources
20532097

20542098
kubelet must ensure that resources are ready for use on the node before running
@@ -2068,53 +2112,18 @@ successfully before allowing the pod to be deleted. This ensures that network-at
20682112
for other Pods, including those that might get scheduled to other nodes. It
20692113
also signals that it is safe to deallocate and delete the ResourceClaim.
20702114

2071-
20722115
![kubelet](./kubelet.png)
20732116

2074-
#### Communication between kubelet and resource kubelet plugin
2075-
2076-
Resource kubelet plugins are discovered through the [kubelet plugin registration
2077-
mechanism](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration). A
2078-
new "ResourcePlugin" type will be used in the Type field of the
2079-
[PluginInfo](https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo)
2080-
response to distinguish the plugin from device and CSI plugins.
2081-
2082-
Under the advertised Unix Domain socket the kubelet plugin provides the
2083-
k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2084-
[CSI](https://github.com/container-storage-interface/spec/blob/master/spec.md),
2085-
with “volume” replaced by “resource” and volume specific parts removed.
2086-
2087-
##### NodeListAndWatchResources
2088-
2089-
NodeListAndWatchResources returns a stream of NodeResourcesResponse objects.
2090-
At the start and whenever resource availability changes, the
2091-
plugin must send one such object with all information to the kubelet. The
2092-
kubelet then syncs that information with ResourceSlice objects.
2093-
2094-
```
2095-
message NodeListAndWatchResourcesRequest {
2096-
}
2097-
2098-
message NodeListAndWatchResourcesResponse {
2099-
repeated k8s.io.api.resource.v1alpha2.ResourceModel resources = 1;
2100-
}
2101-
```
2102-
21032117
##### NodePrepareResource
21042118

21052119
This RPC is called by the kubelet when a Pod that wants to use the specified
21062120
resource is scheduled on a node. The Plugin SHALL assume that this RPC will be
21072121
executed on the node where the resource will be used.
21082122

2109-
ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name and
2110-
one of the ResourceHandles from the ResourceClaimStatus.AllocationResult with
2111-
a matching DriverName should be passed to the Plugin as parameters to identify
2123+
ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name are
2124+
passed to the Plugin as parameters to identify
21122125
the claim and perform resource preparation.
21132126

2114-
ResourceClaim parameters (namespace, UUID, name) are useful for debugging.
2115-
They enable the Plugin to retrieve the full ResourceClaim object, should it
2116-
ever be needed (normally it shouldn't).
2117-
21182127
The Plugin SHALL return fully qualified device name[s].
21192128

21202129
The Plugin SHALL ensure that there are json file[s] in CDI format
@@ -2155,20 +2164,16 @@ message Claim {
21552164
// The name of the Resource claim (ResourceClaim.meta.Name)
21562165
// This field is REQUIRED.
21572166
string name = 3;
2158-
// Resource handle (AllocationResult.ResourceHandles[*].Data)
2159-
// This field is OPTIONAL.
2160-
string resource_handle = 4;
2161-
// Structured parameter resource handle (AllocationResult.ResourceHandles[*].StructuredData).
2162-
// This field is OPTIONAL. If present, it needs to be used
2163-
// instead of resource_handle. It will only have a single entry.
2164-
//
2165-
// Using "repeated" instead of "optional" is a workaround for https://github.com/gogo/protobuf/issues/713.
2166-
repeated k8s.io.api.resource.v1alpha2.StructuredResourceHandle structured_resource_handle = 5;
21672167
}
21682168
```
21692169

2170-
`resource_handle` and `structured_resource_handle` will be set depending on how
2171-
the claim was allocated. See also KEP #3063.
2170+
The allocation result is intentionally not included here. The content of that
2171+
field is version-dependent. The kubelet would need to discover in which version
2172+
each plugin wants the data, then potentially get the claim multiple times
2173+
because only the apiserver can convert between versions. Instead, each plugin
2174+
is required to get the claim itself using its own credentials. In the most common
2175+
case of one plugin per claim, that doubles the number of GETs for each claim
2176+
(once by the kubelet, once by the plugin).
21722177

21732178
```
21742179
message NodePrepareResourcesResponse {

0 commit comments

Comments
 (0)