@@ -73,6 +73,7 @@ SIG Architecture for cross-cutting KEPs).
7373 - [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
7474 - [ Risks and Mitigations] ( #risks-and-mitigations )
7575- [ Design Details] ( #design-details )
76+ - [ Handling race condition] ( #handling-race-condition )
7677 - [ Test Plan] ( #test-plan )
7778 - [ Prerequisite testing updates] ( #prerequisite-testing-updates )
7879 - [ Unit tests] ( #unit-tests )
@@ -280,17 +281,6 @@ This might be a good place to talk about core concepts and how they relate.
280281It is never re-evaluated when the pod is already running.
281282It is storage provider's responsibility to ensure that the running workload is not interrupted.
282283
283- **Possible race condition**
284-
285- There is a race condition between volume modification and pod scheduling :
286- 1. User modifies the volume from storage provider.
287- 3. A new Pod is created and scheduler schedules it with the old affinity.
288- 4. User sets the new affinity to the PV.
289- 5. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch.
290-
291- If this happens, the pod will be stuck in a `ContainerCreating` state.
292- User will have to manually delete the pod, or using Kubernetes [descheduler](https://github.com/kubernetes-sigs/descheduler) or similar.
293-
294284
295285# ## Risks and Mitigations
296286
@@ -315,6 +305,21 @@ required) or even code snippets. If there's any ambiguity about HOW your
315305proposal will be implemented, this is the place to discuss them.
316306-->
317307
308+ # ## Handling race condition
309+
310+ There is a race condition between volume modification and pod scheduling :
311+ 1. User modifies the volume from storage provider.
312+ 3. A new Pod is created and scheduler schedules it with the old affinity.
313+ 4. User sets the new affinity to the PV.
314+ 5. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch.
315+
316+ If this happens, the pod will be stuck in a `ContainerCreating` state.
317+ Kubelet should detect this contidion and reject the pod.
318+ Hopefully some other controllers (StatefulSet controller) will re-create the pod and it will be scheduled to the correct node.
319+
320+ Specifically, kubelet investigates the cause of the failure by checking the status of the underlying VolumeAttachment object.
321+ If `FailedPrecondition` error is found, and PV's nodeAffinity does not match current node,
322+ kubelet will setting pod phase to 'Failed'
318323
319324# ## Test Plan
320325
0 commit comments