-
Notifications
You must be signed in to change notification settings - Fork 14
[Bug]: Pod deletion race condition #16
Copy link
Copy link
Open
Description
Description
When a pod is deleted, the placeholder Slurm job may be terminated as soon as the pod enters the Terminating state. This creates a window where the pod continues running until its terminationGracePeriodSeconds expires, while the associated Slurm resources have already been released.
There are two possible outcomes:
- new pods are scheduled onto a node that won't accept them, causing them to enter a Failed state.
- slurm-native jobs are scheduled onto resources that the terminating pod is still using, which can crash both the new slurm job and the terminating pod.
Steps to Reproduce
This was tested in a node pool with a single replica, but has been seen in larger node pools with pack_serial_at_end turned on.
Pod requirements:
- Doesn't immediately terminate on sigterm
- terminationGracePeriod is long enough
- Create pod1
- Delete pod1
- Create pod2
Result:
$ kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
1s Normal AddedInterface pod/test-full-node-1 Add eth0 [10.245.88.166/32] from k8s-pod-network
0s Normal Pulling pod/test-full-node-1 Pulling image "frolvlad/alpine-glibc:latest"
0s Normal Pulled pod/test-full-node-1 Successfully pulled image "frolvlad/alpine-glibc:latest" in 1.558s (1.558s including waiting). Image size: 7922050 bytes.
0s Normal Created pod/test-full-node-1 Created container test-container
0s Normal Started pod/test-full-node-1 Started container test-container
0s Normal Killing pod/test-full-node-1 Stopping container test-container
0s Warning FailedScheduling pod/test-full-node-2 0/5 nodes are available: 2 node does not match annotation, 3 node(s) had untolerated taint(s).
0s Warning FailedScheduling pod/test-full-node-2 running PreFilter plugin "SlurmBridge": no nodes assigned to job
0s Normal Scheduled pod/test-full-node-2 Successfully assigned default/test-full-node-3 to gpu-dp-k66v9-58pk5
0s Warning UnexpectedAdmissionError pod/test-full-node-2 Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 8, Available: 0, which is unexpected
The important error is the final UnexpectedAdmissionError warning.
Expected Behavior
The placeholder job fully overlaps with the lifecycle of the pod to prevent this issue. It starts before the pod and ends after the pod.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels