@@ -93,6 +93,8 @@ SIG Architecture for cross-cutting KEPs).
9393- [ Alternatives] ( #alternatives )
9494 - [ New GRPC] ( #new-grpc )
9595 - [ User Specified Topology Requirement] ( #user-specified-topology-requirement )
96+ - [ Support for SPs that don't Know Attached Nodes] ( #support-for-sps-that-dont-know-attached-nodes )
97+ - [ Try to Eliminate Race Condition] ( #try-to-eliminate-race-condition )
9698- [ Infrastructure Needed (Optional)] ( #infrastructure-needed-optional )
9799<!-- /toc -->
98100
@@ -224,25 +226,40 @@ required:
224226 values :
225227 - cn-beijing-g
226228` ` `
227- The updated affinity would be:
228- ` ` ` yaml
229- required :
230- nodeSelectorTerms :
231- - matchExpressions :
232- - key : topology.kubernetes.io/region
233- operator : In
234- values :
235- - cn-beijing
236- ` ` `
237-
238- Or in the view of CSI accessibility requirement, from:
229+ Or in the view of CSI accessibility requirement:
239230` ` ` json
240231[{"topology.kubernetes.io/zone" : " cn-beijing-g" }]
241232` ` `
242- to:
243- ` ` ` json
244- [{"topology.kubernetes.io/region" : " cn-beijing" }]
245- ` ` `
233+
234+ The workflow:
235+ 1. User create a ` VolumeAttributeClass`:
236+ ` ` ` yaml
237+ apiVersion: storage.k8s.io/v1beta1
238+ kind: VolumeAttributesClass
239+ metadata:
240+ name: regional
241+ driverName: csi.provider.com
242+ parameters:
243+ type: regional
244+ ` ` `
245+ 2. User modify the `volumeAttributesClassName` in the PVC to `regional`
246+ 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " regional" }`
247+ 4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to
248+ ` ` ` json
249+ [{"topology.kubernetes.io/region": "cn-beijing"}]
250+ ` ` `
251+ 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
252+ ` ` ` yaml
253+ required:
254+ nodeSelectorTerms:
255+ - matchExpressions:
256+ - key: topology.kubernetes.io/region
257+ operator: In
258+ values:
259+ - cn-beijing
260+ ` ` `
261+ then update the PV status to indicate the modification is successful.
262+
246263
247264# ### Story 2
248265
@@ -261,30 +278,51 @@ required:
261278 values:
262279 - available
263280` ` `
264- The updated affinity would be :
265- ` ` ` yaml
266- required:
267- nodeSelectorTerms:
268- - matchExpressions:
269- - key: provider.com/disktype.cloud_essd
270- operator: In
271- values:
272- - available
273- ` ` `
274-
275- Or in the view of CSI accessibility requirement, from :
281+ Or in the view of CSI accessibility requirement :
276282` ` ` json
277283[{"provider.com/disktype.cloud_ssd": "available"}]
278284` ` `
279- to :
280- ` ` ` json
281- [{"provider.com/disktype.cloud_essd": "available"}]
282- ` ` `
283285
284286Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd.
285287We will only allow the modification if the volume is attached to type B nodes.
286288And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes.
287289
290+ In this case, it takes long to modify the volume, the new topology is not strictly less restrictive,
291+ and SP wants to minimize the time window of the race condition :
292+
293+ The workflow :
294+ 1. User create a `VolumeAttributeClass` :
295+ ` ` ` yaml
296+ apiVersion: storage.k8s.io/v1beta1
297+ kind: VolumeAttributesClass
298+ metadata:
299+ name: essd
300+ driverName: csi.provider.com
301+ parameters:
302+ type: cloud_essd
303+ ` ` `
304+ 2. User modify the `volumeAttributesClassName` in the PVC to `essd`
305+ 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " cloud_essd" }`
306+ 4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to
307+ ` ` ` json
308+ [{"provider.com/disktype.cloud_essd": "available"}]
309+ ` ` `
310+ 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
311+ ` ` ` yaml
312+ required:
313+ nodeSelectorTerms:
314+ - matchExpressions:
315+ - key: provider.com/disktype.cloud_essd
316+ operator: In
317+ values:
318+ - available
319+ ` ` `
320+ but the PV status is not updated yet.
321+ From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd : available`,
322+ maybe they will stuck in `ContainerCreating` state until the modification finishes.
323+ 6. external-resizer go back to step 3, retries until `in_progress` is set to false.
324+ 7. external-resizer update the PV status to indicate the modification is successful.
325+
288326
289327# ## Notes/Constraints/Caveats (Optional)
290328
@@ -390,11 +428,14 @@ message ControllerModifyVolumeResponse {
390428 // If it is not specified, the CO MAY assume
391429 // the volume is equally accessible from all nodes in the cluster and
392430 // MAY schedule workloads referencing the volume on any available
393- // node.
431+ // node.
394432 //
395433 // SP MUST only set this field if allow_topology_updates is set
396434 // in the request. SP SHOULD fail the request if it needs to update
397435 // topology but is not allowed by CO.
436+ //
437+ // SP SHOULD only return topology that is a super-set of the
438+ // original topology to avoid race conditions when scheduling.
398439 repeated Topology accessible_topology = 1;
399440
400441 // Indicates whether the modification is still in progress.
@@ -422,42 +463,6 @@ But this KEP does not cover the automatic correction. Kubernetes should only ret
422463
423464Scheduler Enhancements : make sure the Pod is re-queued when the PV is updated.
424465
425- A typical workflow is (taking the user story 1 as an example) :
426- 1. User create a `VolumeAttributeClass` :
427- ` ` ` yaml
428- apiVersion: storage.k8s.io/v1beta1
429- kind: VolumeAttributesClass
430- metadata:
431- name: regional
432- driverName: csi.provider.com
433- parameters:
434- type: regional
435- ` ` `
436- 2. User modify the `volumeAttributesClassName` in the PVC to `regional`
437- 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " regional" }`
438- 4. CSI driver blocks until the modification finished, then return with `accessible_topology` set to `[{"topology.kubernetes.io/region" : " cn-beijing" }]`
439- 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, then update the PV status to indicate the modification is successful.
440-
441- If it takes long to modify the volume, the new topology is not strictly less restrictive,
442- and SP wants to minimize the time window of the race condition (taking the user story 2 as an example) :
443- 1. User create a `VolumeAttributeClass` :
444- ` ` ` yaml
445- apiVersion: storage.k8s.io/v1beta1
446- kind: VolumeAttributesClass
447- metadata:
448- name: essd
449- driverName: csi.provider.com
450- parameters:
451- type: cloud_essd
452- ` ` `
453- 2. User modify the `volumeAttributesClassName` in the PVC to `essd`
454- 3. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " cloud_essd" }`
455- 4. CSI driver returns with `in_progress` set to true, and `accessible_topology` set to `[{"provider.com/disktype.cloud_essd" : " available" }]`
456- 5. external-resizer sets `PersistentVolume.spec.nodeAffinity` accordingly, but the PV status is not updated yet.
457- From now on, the new Pod will be scheduled to nodes with `provider.com/disktype.cloud_essd : available`,
458- maybe they will stuck in `ContainerCreating` state until the modification finishes.
459- 6. external-resizer go back to step 3, retries until `in_progress` is set to false.
460- 7. external-resizer update the PV status to indicate the modification is successful.
461466
462467
463468# ## Test Plan
@@ -1035,7 +1040,7 @@ rpc ControllerModifyVolumeTopology (ControllerModifyVolumeTopologyRequest)
10351040message ControllerModifyVolumeTopologyRequest {
10361041 option (alpha_message) = true;
10371042 string volume_id = 1;
1038- map<string, string> secrest = 2 [(csi_secret) = true];
1043+ map<string, string> secret = 2 [(csi_secret) = true];
10391044 map<string, string> mutable_parameters = 3;
10401045}
10411046
@@ -1067,16 +1072,100 @@ We've considered a design:
10671072* Add `accessibility_requirements` in `ModifyVolumeRequest`, like that in `CreateVolumeRequest`
10681073* Add `allowedTopologies` in `VolumeAttributeClass`, like that in `StorageClass`
10691074
1070- But facing a lot of unresolved questions :
1075+ We determine this lacks vaild use cases.
1076+
1077+ In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes.
1078+ An exception could be : modifying a volume from regional to zonal, and it is not attached to any node.
1079+ In this case, SP may need more information from the CO to determine the desired zone.
1080+ But we don't want user to create a separate VAC for each zone.
1081+ Instead, it maybe easier for user to just attach it to a node, so that SP can determine the desired zone.
1082+
1083+ For the other case (User Story 2), the topology (provider.com/disktype.cloud_essd) is actually not intended for user as an API.
1084+ User just want to modify the disk type, and we implement the underlying limitations as topology.
1085+ So we don't want to let user to specify this topology key directly.
1086+
1087+ Besides, this facing a lot of unresolved questions :
10711088* How to merge `allowedTopologies` from `VolumeAttributeClass`, `StorageClass`?
10721089* Should we use `allowedTopologies` from `StorageClass` if it is not specified in `VolumeAttributeClass`?
10731090* Should we consider the topology of the currently attached nodes?
10741091* Should we consider the topology of all the nodes in the cluster?
10751092
1076- In most cases, SP can determine the desired topology of a volume from the `mutable_parameters`, or from the currently attached nodes.
1077- An exception could be : modifying a volume from regional to zonal, and it is not attached to any node.
1078- In this case, SP will need more information from the CO to determine the desired zone.
1079- But we don't have such use case now, we decided leave it as a future work.
1093+ We may consider this again with vaild use cases.
1094+
1095+ # ## Support for SPs that don't Know Attached Nodes
1096+
1097+ Maybe there are some SPs that don't know the currently attached nodes,
1098+ so they cannot determine whether the topology change will break any workload.
1099+
1100+ Some kind of storage does not have persistent connection between client and server, such as object storage like S3.
1101+ They may fall into this category of SP.
1102+ But as network attached storage, they can be accessed wherever the network can reach.
1103+ So these SPs typically do not use the topology feature at all.
1104+
1105+ So, we decide that for an SP to support this feature,
1106+ they are required to properly detect potential breaking for existing workloads.
1107+
1108+ That said, the candidate design looks :
1109+ Add a new `dry_run` parameter to the `ControllerModifyVolumeRequest`.
1110+ CO first call `ControllerModifyVolume` with `dry_run=true` to fetch the new topology,
1111+ determine if the new topology is compatible with the existing workloads,
1112+ then decide whether to proceed the modification with `dry_run=false`.
1113+
1114+ Another way to get the new topology is further extending the "User Specified Topology Requirement" section,
1115+ Making it required for user to explicitly specify the new topology in the VAC and
1116+ remove `accessible_topology` from `ControllerModifyVolumeResponse`.
1117+ That is to say, SP must accept the new topology specified by user or it should reject the request.
1118+ The workflow will become :
1119+ 1. User create a `VolumeAttributeClass` :
1120+ ` ` ` yaml
1121+ apiVersion: storage.k8s.io/v1beta1
1122+ kind: VolumeAttributesClass
1123+ metadata:
1124+ name: regional
1125+ driverName: csi.provider.com
1126+ parameters:
1127+ type: regional
1128+ allowedTopologies:
1129+ - matchLabelExpressions:
1130+ - key: topology.kubernetes.io/region
1131+ values:
1132+ - cn-beijing
1133+ ` ` `
1134+ 2. User modify the `volumeAttributesClassName` in the PVC to `regional`
1135+ 3. external-resizer verifies all the nodes that all the nodes with this volume attached satisfy the `allowedTopologies`
1136+ 4. external-resizer sets `PersistentVolume.spec.nodeAffinity` to
1137+ ` ` ` yaml
1138+ required:
1139+ nodeSelectorTerms:
1140+ - matchExpressions:
1141+ - key: topology.kubernetes.io/region
1142+ operator: In
1143+ values:
1144+ - cn-beijing
1145+ ` ` `
1146+ 5. external-resizer initiate ControllerModifyVolume with `allow_topology_updates` set to true, `mutable_parameters` set to `{"type" : " regional" }`
1147+ 6. CSI driver blocks until the modification finished
1148+ 7. external-resizer then update the PV status to indicate the modification is successful.
1149+
1150+ Besides the reasons mentioned above, we also facing a critical drawback for this design :
1151+ Topology can have many orthogonal aspects, such as above mentioned zone/region and disk type.
1152+ If SP cannot return the topology, user will need to be aware of all aspects of topology used by SP.
1153+ And SP will not able to extend the topology in the future, since VAC is immutable.
1154+
1155+ Note that the above designs are also racy.
1156+ We may still break some workloads that just started after the check but before the modification.
1157+
1158+ # ## Try to Eliminate Race Condition
1159+
1160+ Haven't figured out a reasonable way to do this.
1161+ Basically, we need to
1162+ 1. pause scheduling of the pods that are using the volume;
1163+ 2. get ack from scheduler that pause is effective;
1164+ 3. conduct ModifyVolume;
1165+ 4. retrive the new topology then resume scheduling.
1166+
1167+ I think it is too complex and not worth it.
1168+ Similar failure is already possible with KEP-4876 when CSINode allocatable is out-of-date.
10801169
10811170<!--
10821171What other approaches did you consider, and why did you rule them out? These do
0 commit comments