@@ -78,7 +78,7 @@ SIG Architecture for cross-cutting KEPs).
78
78
- [ Risks and Mitigations] ( #risks-and-mitigations )
79
79
- [ Feature not used] ( #feature-not-used )
80
80
- [ Compromised node] ( #compromised-node )
81
- - [ Compromised resource driver plugin] ( #compromised-resource-driver -plugin )
81
+ - [ Compromised kubelet plugin] ( #compromised-kubelet -plugin )
82
82
- [ User permissions and quotas] ( #user-permissions-and-quotas )
83
83
- [ Usability] ( #usability )
84
84
- [ Design Details] ( #design-details )
@@ -110,9 +110,10 @@ SIG Architecture for cross-cutting KEPs).
110
110
- [ PreBind] ( #prebind )
111
111
- [ Unreserve] ( #unreserve )
112
112
- [ kubelet] ( #kubelet )
113
- - [ Managing resources] ( #managing-resources )
114
113
- [ Communication between kubelet and resource kubelet plugin] ( #communication-between-kubelet-and-resource-kubelet-plugin )
115
- - [ NodeListAndWatchResources] ( #nodelistandwatchresources )
114
+ - [ Version skew] ( #version-skew )
115
+ - [ Security] ( #security )
116
+ - [ Managing resources] ( #managing-resources )
116
117
- [ NodePrepareResource] ( #nodeprepareresource )
117
118
- [ NodeUnprepareResources] ( #nodeunprepareresources )
118
119
- [ Simulation with CA] ( #simulation-with-ca )
@@ -526,25 +527,35 @@ In production, a similar PodTemplateSpec in a Deployment will be used.
526
527
### Publishing node resources
527
528
528
529
The resources available on a node need to be published to the API server. In
529
- the typical case, this is expected to be published by the on-node driver via
530
- the kubelet, as described below . However, the source of this data may vary; for
530
+ the typical case, this is expected to be published by the on-node driver
531
+ as described in the next paragraph . However, the source of this data may vary; for
531
532
example, a cloud provider controller could populate this based upon information
532
533
from the cloud provider API.
533
534
534
- In the kubelet case, each kubelet publishes kubelet publishes a set of
535
- ` ResourceSlice ` objects to the API server with content provided by the
536
- corresponding DRA drivers running on its node. Access control through the node
537
- authorizer ensures that the kubelet running on one node is not allowed to
538
- create or modify ` ResourceSlices ` belonging to another node. A ` nodeName `
539
- field in each ` ResourceSlice ` object is used to determine which objects are
540
- managed by which kubelet.
541
-
542
- ** NOTE:** ` ResourceSlices ` are published separately for each driver, using
543
- whatever version of the ` resource.k8s.io ` API is supported by the kubelet. That
544
- same version is then also used in the gRPC interface between the kubelet and
545
- the DRA drivers providing content for those objects. It might be possible to
546
- support version skew (= keeping kubelet at an older version than the control
547
- plane and the DRA drivers) in the future, but currently this is out of scope.
535
+ In the kubelet case, each driver running on a node publishes a set of
536
+ ` ResourceSlice ` objects to the API server for its own resources, using its
537
+ connection to the apiserver. Access control through a validating admission
538
+ policy can ensure that the drivers running on one node are not allowed to
539
+ create or modify ` ResourceSlices ` belonging to another node. The ` nodeName `
540
+ and ` driverName ` fields in each ` ResourceSlice ` object are used to determine which objects are
541
+ managed by which driver instance. The owner reference ensures that objects
542
+ beloging to a node get cleaned up when the node gets removed.
543
+
544
+ In addition, whenever kubelet starts, it first deletes all ` ResourceSlices `
545
+ belonging to the node with a ` DeleteCollection ` call that uses the node name in
546
+ a field filter. This ensures that no pods depending in DRA get scheduled to the
547
+ node until the required DRA drivers have started up again (node reboot) and
548
+ reconnected to kubelet (kubelet restart). It also ensures that drivers which
549
+ don't get started up again at all don't leave stale ` ResourceSlices `
550
+ behind. Garbage collection does not help in this case because the node object
551
+ still exists. For the same reasons, the ResourceSlices belonging to a driver
552
+ get removed when the driver unregisters, this time with a field filter for node
553
+ name and driver name.
554
+
555
+ Deleting ` ResourceSlices ` is possible because all information in them can be
556
+ reconstructed by the driver. This has no effect on already allocated claims
557
+ because the allocation result is tracked in those claims, not the
558
+ ` ResourceSlice ` objects (see [ below] ( #state-and-communication ) ).
548
559
549
560
Embedded inside each ` ResourceSlice ` is the representation of the resources
550
561
managed by a driver according to a specific "structured model". In the example
@@ -842,7 +853,7 @@ return quickly without doing any work for pods.
842
853
843
854
# ### Compromised node
844
855
845
- Kubelet is intentionally limited to read-only access for ResourceClass and ResourceClaim
856
+ Kubelet is intentionally limited to read-only access for ResourceClaim
846
857
to prevent that a
847
858
compromised kubelet interferes with scheduling of pending pods, for example
848
859
by updating status information normally set by the scheduler.
@@ -863,22 +874,24 @@ allocation, such an attack is still possible, but the attack code would have to
863
874
be different for each resource driver because all of them will use structured
864
875
parameters differently for reporting resource availability.
865
876
866
- # ### Compromised resource driver plugin
877
+ # ### Compromised kubelet plugin
867
878
868
879
This is the result of an attack against the resource driver, either from a
869
880
container which uses a resource exposed by the driver, a compromised kubelet
870
881
which interacts with the plugin, or due to resource driver running on a node
871
882
with a compromised root account.
872
883
873
- The resource driver plugin only needs read access to objects described in this
874
- KEP, so compromising it does not interfere with dynamic resource allocation for
875
- other drivers.
884
+ The resource driver needs write access for ResourceSlices. It can be deployed so
885
+ that it can only write objects associated with the node, so the impact of a
886
+ compromise would be limited to the node. Other drivers on the node could also
887
+ be impacted because there is no separation by driver.
876
888
877
- A resource driver may need root access on the node to manage
889
+ However, a resource driver may need root access on the node to manage
878
890
hardware. Attacking the driver therefore may lead to root privilege
879
891
escalation. Ideally, driver authors should try to avoid depending on root
880
892
permissions and instead use capabilities or special permissions for the kernel
881
- APIs that they depend on.
893
+ APIs that they depend on. Long term, limiting apiserver access by driver
894
+ name would be useful.
882
895
883
896
A resource driver may also need privileged access to remote services to manage
884
897
network-attached devices. Resource driver vendors and cluster administrators
@@ -931,7 +944,7 @@ Several components must be implemented or modified in Kubernetes:
931
944
ResourceClaim (directly or through a template) and ensure that the
932
945
resource is allocated before the Pod gets scheduled, similar to
933
946
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/scheduling/scheduler_binder.go
934
- - Kubelet must be extended to retrieve information from ResourceClaims
947
+ - Kubelet must be extended to manage ResourceClaims
935
948
and to call a resource kubelet plugin. That plugin returns CDI device ID(s)
936
949
which then must be passed to the container runtime.
937
950
@@ -1188,13 +1201,13 @@ drivers are expected to be written for Kubernetes.
1188
1201
1189
1202
##### ResourceSlice
1190
1203
1191
- For each node, one or more ResourceSlice objects get created. The kubelet
1192
- publishes them with the node as the owner, so they get deleted when a node goes
1204
+ For each node, one or more ResourceSlice objects get created. The drivers
1205
+ on a node publish them with the node as the owner, so they get deleted when a node goes
1193
1206
down and then gets removed.
1194
1207
1195
1208
All list types are atomic because that makes tracking the owner for
1196
1209
server-side-apply (SSA) simpler. Patching individual list elements is not
1197
- needed and there is a single owner (kubelet) .
1210
+ needed and there is a single owner.
1198
1211
1199
1212
```go
1200
1213
// ResourceSlice provides information about available
@@ -2049,6 +2062,53 @@ Unreserve is called in two scenarios:
2049
2062
2050
2063
### kubelet
2051
2064
2065
+ #### Communication between kubelet and resource kubelet plugin
2066
+
2067
+ Resource kubelet plugins are discovered through the [ kubelet plugin registration
2068
+ mechanism] ( https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration ) . A
2069
+ new "ResourcePlugin" type will be used in the Type field of the
2070
+ [ PluginInfo] ( https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo )
2071
+ response to distinguish the plugin from device and CSI plugins.
2072
+
2073
+ Under the advertised Unix Domain socket the kubelet plugin provides the
2074
+ k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2075
+ [ CSI] ( https://github.com/container-storage-interface/spec/blob/master/spec.md ) ,
2076
+ with “volume” replaced by “resource” and volume specific parts removed.
2077
+
2078
+ #### Version skew
2079
+
2080
+ Previously, kubelet retrieved ResourceClaims and published ResourceSlices on
2081
+ behalf of DRA drivers on the node. The information included in those got passed
2082
+ between API server, kubelet, and kubelet plugin using the version of the
2083
+ resource.k8s.io used by the kubelet. Combining a kubelet using some older API
2084
+ version with a plugin using a new version was not possible because conversion
2085
+ of the resource.k8s.io types is only supported in the API server and an old
2086
+ kubelet wouldn't know about a new version anyway.
2087
+
2088
+ Keeping kubelet at some old release while upgrading the control and DRA drivers
2089
+ is desirable and officially supported by Kubernetes. To support the same when
2090
+ using DRA, the kubelet now leaves [ ResourceSlice
2091
+ handling] ( #publishing-node-resources ) almost entirely to the plugins. The
2092
+ remaining calls are done with whatever resource.k8s.io API version is the
2093
+ latest known to the kubelet. To support version skew, support for older API
2094
+ versions must be preserved as far back as support for older kubelet releases is
2095
+ desired.
2096
+
2097
+ #### Security
2098
+
2099
+ The daemonset of a DRA driver must be configured to have a service account
2100
+ which grants the following permissions:
2101
+ - get/list/watch/create/update/oatch/delete ResourceSlice
2102
+ - get ResourceClaim
2103
+ - get Node
2104
+
2105
+ Ideally, write access to ResourceSlice should be limited to objects belonging
2106
+ to the node. This is possible with a [ validating admission
2107
+ policy] ( https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/ ) . As
2108
+ this is not a core feature of the DRA KEP, instructions for how to do that will
2109
+ not be included here. Instead, the DRA example driver will provide an example
2110
+ and documentation.
2111
+
2052
2112
#### Managing resources
2053
2113
2054
2114
kubelet must ensure that resources are ready for use on the node before running
@@ -2068,53 +2128,18 @@ successfully before allowing the pod to be deleted. This ensures that network-at
2068
2128
for other Pods, including those that might get scheduled to other nodes. It
2069
2129
also signals that it is safe to deallocate and delete the ResourceClaim.
2070
2130
2071
-
2072
2131
![ kubelet] ( ./kubelet.png )
2073
2132
2074
- #### Communication between kubelet and resource kubelet plugin
2075
-
2076
- Resource kubelet plugins are discovered through the [ kubelet plugin registration
2077
- mechanism] ( https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-registration ) . A
2078
- new "ResourcePlugin" type will be used in the Type field of the
2079
- [ PluginInfo] ( https://pkg.go.dev/k8s.io/kubelet/pkg/apis/pluginregistration/v1#PluginInfo )
2080
- response to distinguish the plugin from device and CSI plugins.
2081
-
2082
- Under the advertised Unix Domain socket the kubelet plugin provides the
2083
- k8s.io/kubelet/pkg/apis/dra gRPC interface. It was inspired by
2084
- [ CSI] ( https://github.com/container-storage-interface/spec/blob/master/spec.md ) ,
2085
- with “volume” replaced by “resource” and volume specific parts removed.
2086
-
2087
- ##### NodeListAndWatchResources
2088
-
2089
- NodeListAndWatchResources returns a stream of NodeResourcesResponse objects.
2090
- At the start and whenever resource availability changes, the
2091
- plugin must send one such object with all information to the kubelet. The
2092
- kubelet then syncs that information with ResourceSlice objects.
2093
-
2094
- ```
2095
- message NodeListAndWatchResourcesRequest {
2096
- }
2097
-
2098
- message NodeListAndWatchResourcesResponse {
2099
- repeated k8s.io.api.resource.v1alpha2.ResourceModel resources = 1;
2100
- }
2101
- ```
2102
-
2103
2133
##### NodePrepareResource
2104
2134
2105
2135
This RPC is called by the kubelet when a Pod that wants to use the specified
2106
2136
resource is scheduled on a node. The Plugin SHALL assume that this RPC will be
2107
2137
executed on the node where the resource will be used.
2108
2138
2109
- ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name and
2110
- one of the ResourceHandles from the ResourceClaimStatus.AllocationResult with
2111
- a matching DriverName should be passed to the Plugin as parameters to identify
2139
+ ResourceClaim.meta.Namespace, ResourceClaim.meta.UID, ResourceClaim.Name are
2140
+ passed to the Plugin as parameters to identify
2112
2141
the claim and perform resource preparation.
2113
2142
2114
- ResourceClaim parameters (namespace, UUID, name) are useful for debugging.
2115
- They enable the Plugin to retrieve the full ResourceClaim object, should it
2116
- ever be needed (normally it shouldn't).
2117
-
2118
2143
The Plugin SHALL return fully qualified device name[ s] .
2119
2144
2120
2145
The Plugin SHALL ensure that there are json file[ s] in CDI format
@@ -2155,20 +2180,16 @@ message Claim {
2155
2180
// The name of the Resource claim (ResourceClaim.meta.Name)
2156
2181
// This field is REQUIRED.
2157
2182
string name = 3;
2158
- // Resource handle (AllocationResult.ResourceHandles[*].Data)
2159
- // This field is OPTIONAL.
2160
- string resource_handle = 4;
2161
- // Structured parameter resource handle (AllocationResult.ResourceHandles[*].StructuredData).
2162
- // This field is OPTIONAL. If present, it needs to be used
2163
- // instead of resource_handle. It will only have a single entry.
2164
- //
2165
- // Using "repeated" instead of "optional" is a workaround for https://github.com/gogo/protobuf/issues/713.
2166
- repeated k8s.io.api.resource.v1alpha2.StructuredResourceHandle structured_resource_handle = 5;
2167
2183
}
2168
2184
```
2169
2185
2170
- ` resource_handle ` and ` structured_resource_handle ` will be set depending on how
2171
- the claim was allocated. See also KEP #3063 .
2186
+ The allocation result is intentionally not included here. The content of that
2187
+ field is version-dependent. The kubelet would need to discover in which version
2188
+ each plugin wants the data, then potentially get the claim multiple times
2189
+ because only the apiserver can convert between versions. Instead, each plugin
2190
+ is required to get the claim itself using its own credentials. In the most common
2191
+ case of one plugin per claim, that doubles the number of GETs for each claim
2192
+ (once by the kubelet, once by the plugin).
2172
2193
2173
2194
```
2174
2195
message NodePrepareResourcesResponse {
0 commit comments