@@ -286,11 +286,13 @@ Whoever modifies the `PersistentVolume.spec.nodeAffinity` field should ensure th
286286no running Pods on nodes with incompatible labels are using the PV.
287287Kubernetes will not verify this. It is expensive and racy.
288288
289- If the incompatibility does happen, we don't guarantee that those Pods will continue to run without any issue.
289+ If the incompatibility does happen (i.e. someone updated nodeAffinity, making running Pods violate the new nodeAffinity),
290+ we don't guarantee that those Pods will continue to run without any issue.
290291However, we try our best not to interrupt them :
291292- For volumes that not yet present in the Node.status.volumesAttached field,
292293 we fail the Pods that use them, since we are sure the Pods have never been running.
293294 (see [Handling race condition](#handling-race-condition) below)
295+ - We will not detach the volume. So if the volume is actually accessible (depends on the storage provider), the Pod can continue to run.
294296- For CSI drivers with `requiresRepublish` set to true, we will stop calling NodePublishVolume periodically. and an event is emitted.
295297- For CSI drivers with `requiresRepublish` set to false, an event is emitted on kubelet restart. Otherwise the pod should continue to run.
296298It is not re-evaluated when the pod is already running.
@@ -343,7 +345,7 @@ There is a race condition between volume modification and pod scheduling:
3433455. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch.
344346
345347If this happens, the pod will be stuck in a `ContainerCreating` state.
346- Kubelet should detect this contidion and reject the pod.
348+ Kubelet should detect this condition and reject the pod.
347349Hopefully some other controllers (StatefulSet controller) will re-create the pod and it will be scheduled to the correct node.
348350
349351Specifically, kubelet should reject the pod (setting pod phase to 'Failed')
@@ -585,8 +587,10 @@ enhancement:
585587
586588This feature involves changes to the kubelet, and APIServer. But they are not strongly coupled.
587589
588- an n-3 kubelet will not able to fail the mis-scheduled pods. User can still manually delete the pods. Otherwise it should be fine.
589- an new kubelt can also work with old APIServer. Although this should not happen.
590+ An n-3 kubelet will not able to fail the mis-scheduled pods. The mis-scheduled pods will stuck at ContainerCreating status.
591+ If the kubelet is upgraded afterwards, it will properly fail those pods.
592+ User can also manually delete the pods if they don't want to upgrade kubelet soon.
593+ If user does not actually update the PV nodeAffinity, there will be no such mis-scheduled pods and everything should be fine.
590594
591595kube-scheduler is not directly affected.
592596It just read the latest PV nodeAffinity for scheduling decision regardless of whether it's being updated or not.
@@ -651,9 +655,10 @@ PV `spec.nodeAffinity` becomes mutable.
651655If a pod being scheduled to a node that is incompatible with the PV's nodeAffinity, the pod will fail.
652656Previously, it will be stuck at `ContainerCreating` status.
653657
654- This should be rare, since we don't allow PV nodeAffinity to be updated,
658+ This should be rare before enabling this feature , since we don't allow PV nodeAffinity to be updated,
655659nor CSI driver can change the topology reported from NodeGetInfo.
656660So this is only possible if the user edited the node labels manually, or is running an incompatible scheduler.
661+ Existing workflow will unlikely be affected by this behavior change.
657662
658663# ##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
659664
@@ -690,7 +695,7 @@ You can take a look at one potential example of such test in:
690695https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
691696-->
692697
693- Yes. unit test will verify the validation and kubelet behavior when the feature gate is enabled or disabled.
698+ Will add unit test to verify the validation and kubelet behavior when the feature gate is enabled or disabled.
694699
695700# ## Rollout, Upgrade and Rollback Planning
696701
@@ -766,7 +771,15 @@ and operation of this feature.
766771Recall that end users cannot usually observe component logs or access metrics.
767772-->
768773
769- See a previously Pending or ContainerCreating Pod now properly Running.
774+ 1. nodeAffinity can now be updated for existing volumes
775+ 2. pods that cannot be run due volume that can't be attached are now being failed by kubelet
776+
777+ As the consequences, if a Pod is previously stuck due to out-of-date PV nodeAffinity,
778+ now user can update the PV to correct the nodeAffinity, and see the Pod entering Running state eventually.
779+ For Pods stuck in ContainerCreating due to storage provider unable to attach the volume to the scheduled node,
780+ The Pod will be rejected by kubelet and re-created at the correct node.
781+ For Pods stuck in Pending due to no suitable node available,
782+ scheduler will retry to schedule For Pods stuck in ContainerCreating due Pod according to the updated nodeAffinity.
770783
771784# ##### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
772785
0 commit comments