@@ -110,9 +110,10 @@ SIG Architecture for cross-cutting KEPs).
110
110
- [ PreBind] ( #prebind )
111
111
- [ Unreserve] ( #unreserve )
112
112
- [ kubelet] ( #kubelet )
113
- - [ Managing resources] ( #managing-resources )
114
113
- [ Communication between kubelet and resource kubelet plugin] ( #communication-between-kubelet-and-resource-kubelet-plugin )
115
- - [ NodeListAndWatchResources] ( #nodelistandwatchresources )
114
+ - [ Version skew] ( #version-skew )
115
+ - [ Security] ( #security )
116
+ - [ Managing resources] ( #managing-resources )
116
117
- [ NodePrepareResource] ( #nodeprepareresource )
117
118
- [ NodeUnprepareResources] ( #nodeunprepareresources )
118
119
- [ Simulation with CA] ( #simulation-with-ca )
@@ -531,20 +532,13 @@ the kubelet, as described below. However, the source of this data may vary; for
531
532
example, a cloud provider controller could populate this based upon information
532
533
from the cloud provider API.
533
534
534
- In the kubelet case, each kubelet publishes kubelet publishes a set of
535
- ` ResourceSlice ` objects to the API server with content provided by the
536
- corresponding DRA drivers running on its node. Access control through the node
537
- authorizer ensures that the kubelet running on one node is not allowed to
538
- create or modify ` ResourceSlices ` belonging to another node. A ` nodeName `
539
- field in each ` ResourceSlice ` object is used to determine which objects are
540
- managed by which kubelet.
541
-
542
- ** NOTE:** ` ResourceSlices ` are published separately for each driver, using
543
- whatever version of the ` resource.k8s.io ` API is supported by the kubelet. That
544
- same version is then also used in the gRPC interface between the kubelet and
545
- the DRA drivers providing content for those objects. It might be possible to
546
- support version skew (= keeping kubelet at an older version than the control
547
- plane and the DRA drivers) in the future, but currently this is out of scope.
535
+ In the kubelet case, each driver running on a node publishes a set of
536
+ ` ResourceSlice ` objects to the API server for its own resources, using its
537
+ connection to the apiserver. Access control through a validating admission
538
+ policy can ensure that the drivers running on one node are not allowed to
539
+ create or modify ` ResourceSlices ` belonging to another node. The ` nodeName `
540
+ and ` driverName ` fields in each ` ResourceSlice ` object are used to determine which objects are
541
+ managed by which driver instance.
548
542
549
543
Embedded inside each ` ResourceSlice ` is the representation of the resources
550
544
managed by a driver according to a specific "structured model". In the example
@@ -931,7 +925,7 @@ Several components must be implemented or modified in Kubernetes:
931
925
ResourceClaim (directly or through a template) and ensure that the
932
926
resource is allocated before the Pod gets scheduled, similar to
933
927
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go
934
- - Kubelet must be extended to retrieve information from ResourceClaims
928
+ - Kubelet must be extended to manage ResourceClaims
935
929
and to call a resource kubelet plugin. That plugin returns CDI device ID(s)
936
930
which then must be passed to the container runtime.
937
931
@@ -1188,13 +1182,13 @@ drivers are expected to be written for Kubernetes.
1188
1182
1189
1183
##### ResourceSlice
1190
1184
1191
- For each node, one or more ResourceSlice objects get created. The kubelet
1192
- publishes them with the node as the owner, so they get deleted when a node goes
1185
+ For each node, one or more ResourceSlice objects get created. The drivers
1186
+ on a node publish them with the node as the owner, so they get deleted when a node goes
1193
1187
down and then gets removed.
1194
1188
1195
1189
All list types are atomic because that makes tracking the owner for
1196
1190
server-side-apply (SSA) simpler. Patching individual list elements is not
1197
- needed and there is a single owner (kubelet) .
1191
+ needed and there is a single owner.
1198
1192
1199
1193
```go
1200
1194
// ResourceSlice provides information about available
@@ -2049,6 +2043,56 @@ Unreserve is called in two scenarios:
2049
2043
2050
2044
### kubelet
2051
2045
2046
+ #### Communication between kubelet and resource kubelet plugin
2047
+
2048
+ Resource kubelet plugins are discovered through the [ kubelet plugin registration
2049
+ mechanism] ( https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration ) . A
2050
+ new "ResourcePlugin" type will be used in the Type field of the
2051
+ [ PluginInfo] ( https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo )
2052
+ response to distinguish the plugin from device and CSI plugins.
2053
+
2054
+ Under the advertised Unix Domain socket the kubelet plugin provides the
2055
+ k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2056
+ [ CSI] ( https://github.com/container-storage-interface/spec/blob/master/spec.md ) ,
2057
+ with “volume” replaced by “resource” and volume specific parts removed.
2058
+
2059
+ #### Version skew
2060
+
2061
+ Previously, kubelet retrieved ResourceClaims and published ResourceSlices on
2062
+ behalf of DRA drivers on the node. The information included in those got passed
2063
+ between API server, kubelet, and kubelet plugin using the version of the
2064
+ resource.k8s.io used by the kubelet. Combining a kubelet using some older API
2065
+ version with a plugin using a new version was not possible because conversion
2066
+ of the resource.k8s.io types is only supported in the API server and an old
2067
+ kubelet wouldn't know about a new version anyway.
2068
+
2069
+ Keeping kubelet at some old release while upgrading the control and DRA drivers
2070
+ is desirable and officially supported by Kubernetes. To support the same when
2071
+ using DRA, the kubelet now leaves ResourceSlice handling (almost) entirely to
2072
+ the plugins. The one exception is that it deletes all ResourceSlices on
2073
+ startup. This ensures that no pods depending in DRA get scheduled to the node
2074
+ until the required DRA drivers have started up again. It also ensures that
2075
+ drivers which don't get started up again at all don't leave stale
2076
+ ResourceSlices behind. For the same reasons, the ResourceSlices belonging to a
2077
+ driver get removed when the driver unregisters. This access is done with
2078
+ whatever resource.k8s.io API version is the latest known to the kubelet. To
2079
+ support version skew, support for older API versions must be preserved as far
2080
+ back as support for older kubelet releases is desired.
2081
+
2082
+ #### Security
2083
+
2084
+ The daemonset of a DRA driver must be configured to have a service account
2085
+ which grants the following permissions:
2086
+ - read/write/patch ResourceSlice
2087
+ - read ResourceClaim
2088
+
2089
+ Ideally, write access to ResourceSlice should be limited to objects belonging
2090
+ to the node. This is possible with a [ validating admission
2091
+ policy] ( https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/ ) . As
2092
+ this is not a core feature of the DRA KEP, instructions for how to do that will
2093
+ not be included here. Instead, the DRA example driver will provide an example
2094
+ and documentation.
2095
+
2052
2096
#### Managing resources
2053
2097
2054
2098
kubelet must ensure that resources are ready for use on the node before running
@@ -2068,53 +2112,18 @@ successfully before allowing the pod to be deleted. This ensures that network-at
2068
2112
for other Pods, including those that might get scheduled to other nodes. It
2069
2113
also signals that it is safe to deallocate and delete the ResourceClaim.
2070
2114
2071
-
2072
2115
![ kubelet] ( ./kubelet.png )
2073
2116
2074
- #### Communication between kubelet and resource kubelet plugin
2075
-
2076
- Resource kubelet plugins are discovered through the [ kubelet plugin registration
2077
- mechanism] ( https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration ) . A
2078
- new "ResourcePlugin" type will be used in the Type field of the
2079
- [ PluginInfo] ( https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo )
2080
- response to distinguish the plugin from device and CSI plugins.
2081
-
2082
- Under the advertised Unix Domain socket the kubelet plugin provides the
2083
- k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2084
- [ CSI] ( https://github.com/container-storage-interface/spec/blob/master/spec.md ) ,
2085
- with “volume” replaced by “resource” and volume specific parts removed.
2086
-
2087
- ##### NodeListAndWatchResources
2088
-
2089
- NodeListAndWatchResources returns a stream of NodeResourcesResponse objects.
2090
- At the start and whenever resource availability changes, the
2091
- plugin must send one such object with all information to the kubelet. The
2092
- kubelet then syncs that information with ResourceSlice objects.
2093
-
2094
- ```
2095
- message NodeListAndWatchResourcesRequest {
2096
- }
2097
-
2098
- message NodeListAndWatchResourcesResponse {
2099
- repeated k8s.io.api.resource.v1alpha2.ResourceModel resources = 1;
2100
- }
2101
- ```
2102
-
2103
2117
##### NodePrepareResource
2104
2118
2105
2119
This RPC is called by the kubelet when a Pod that wants to use the specified
2106
2120
resource is scheduled on a node. The Plugin SHALL assume that this RPC will be
2107
2121
executed on the node where the resource will be used.
2108
2122
2109
- ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name and
2110
- one of the ResourceHandles from the ResourceClaimStatus.AllocationResult with
2111
- a matching DriverName should be passed to the Plugin as parameters to identify
2123
+ ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name are
2124
+ passed to the Plugin as parameters to identify
2112
2125
the claim and perform resource preparation.
2113
2126
2114
- ResourceClaim parameters (namespace, UUID, name) are useful for debugging.
2115
- They enable the Plugin to retrieve the full ResourceClaim object, should it
2116
- ever be needed (normally it shouldn't).
2117
-
2118
2127
The Plugin SHALL return fully qualified device name[ s] .
2119
2128
2120
2129
The Plugin SHALL ensure that there are json file[ s] in CDI format
@@ -2155,20 +2164,16 @@ message Claim {
2155
2164
// The name of the Resource claim (ResourceClaim.meta.Name)
2156
2165
// This field is REQUIRED.
2157
2166
string name = 3;
2158
- // Resource handle (AllocationResult.ResourceHandles[*].Data)
2159
- // This field is OPTIONAL.
2160
- string resource_handle = 4;
2161
- // Structured parameter resource handle (AllocationResult.ResourceHandles[*].StructuredData).
2162
- // This field is OPTIONAL. If present, it needs to be used
2163
- // instead of resource_handle. It will only have a single entry.
2164
- //
2165
- // Using "repeated" instead of "optional" is a workaround for https://github.com/gogo/protobuf/issues/713.
2166
- repeated k8s.io.api.resource.v1alpha2.StructuredResourceHandle structured_resource_handle = 5;
2167
2167
}
2168
2168
```
2169
2169
2170
- ` resource_handle ` and ` structured_resource_handle ` will be set depending on how
2171
- the claim was allocated. See also KEP #3063 .
2170
+ The allocation result is intentionally not included here. The content of that
2171
+ field is version-dependent. The kubelet would need to discover in which version
2172
+ each plugin wants the data, then potentially get the claim multiple times
2173
+ because only the apiserver can convert between versions. Instead, each plugin
2174
+ is required to get the claim itself using its own credentials. In the most common
2175
+ case of one plugin per claim, that doubles the number of GETs for each claim
2176
+ (once by the kubelet, once by the plugin).
2172
2177
2173
2178
```
2174
2179
message NodePrepareResourcesResponse {
0 commit comments