Skip to content

Commit 19d29ec

Browse files
committed
reject mis-scheduled pod
1 parent 95584e0 commit 19d29ec

File tree

1 file changed

+16
-11
lines changed
  • keps/sig-storage/5381-mutable-pv-affinity

1 file changed

+16
-11
lines changed

keps/sig-storage/5381-mutable-pv-affinity/README.md

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ SIG Architecture for cross-cutting KEPs).
7373
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
7474
- [Risks and Mitigations](#risks-and-mitigations)
7575
- [Design Details](#design-details)
76+
- [Handling race condition](#handling-race-condition)
7677
- [Test Plan](#test-plan)
7778
- [Prerequisite testing updates](#prerequisite-testing-updates)
7879
- [Unit tests](#unit-tests)
@@ -280,17 +281,6 @@ This might be a good place to talk about core concepts and how they relate.
280281
It is never re-evaluated when the pod is already running.
281282
It is storage provider's responsibility to ensure that the running workload is not interrupted.
282283

283-
**Possible race condition**
284-
285-
There is a race condition between volume modification and pod scheduling:
286-
1. User modifies the volume from storage provider.
287-
3. A new Pod is created and scheduler schedules it with the old affinity.
288-
4. User sets the new affinity to the PV.
289-
5. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch.
290-
291-
If this happens, the pod will be stuck in a `ContainerCreating` state.
292-
User will have to manually delete the pod, or using Kubernetes [descheduler](https://github.com/kubernetes-sigs/descheduler) or similar.
293-
294284

295285
### Risks and Mitigations
296286

@@ -315,6 +305,21 @@ required) or even code snippets. If there's any ambiguity about HOW your
315305
proposal will be implemented, this is the place to discuss them.
316306
-->
317307

308+
### Handling race condition
309+
310+
There is a race condition between volume modification and pod scheduling:
311+
1. User modifies the volume from storage provider.
312+
3. A new Pod is created and scheduler schedules it with the old affinity.
313+
4. User sets the new affinity to the PV.
314+
5. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch.
315+
316+
If this happens, the pod will be stuck in a `ContainerCreating` state.
317+
Kubelet should detect this contidion and reject the pod.
318+
Hopefully some other controllers (StatefulSet controller) will re-create the pod and it will be scheduled to the correct node.
319+
320+
Specifically, kubelet investigates the cause of the failure by checking the status of the underlying VolumeAttachment object.
321+
If `FailedPrecondition` error is found, and PV's nodeAffinity does not match current node,
322+
kubelet will setting pod phase to 'Failed'
318323

319324
### Test Plan
320325

0 commit comments

Comments
 (0)