Skip to content

[Bug]: Pod deletion race condition #16

@cgetzen

Description

@cgetzen

Description

When a pod is deleted, the placeholder Slurm job may be terminated as soon as the pod enters the Terminating state. This creates a window where the pod continues running until its terminationGracePeriodSeconds expires, while the associated Slurm resources have already been released.

There are two possible outcomes:

  • new pods are scheduled onto a node that won't accept them, causing them to enter a Failed state.
  • slurm-native jobs are scheduled onto resources that the terminating pod is still using, which can crash both the new slurm job and the terminating pod.

Steps to Reproduce

This was tested in a node pool with a single replica, but has been seen in larger node pools with pack_serial_at_end turned on.

Pod requirements:

  • Doesn't immediately terminate on sigterm
  • terminationGracePeriod is long enough
  1. Create pod1
  2. Delete pod1
  3. Create pod2

Result:

$ kubectl get events
LAST SEEN   TYPE      REASON                     OBJECT                           MESSAGE
1s          Normal    AddedInterface         pod/test-full-node-1             Add eth0 [10.245.88.166/32] from k8s-pod-network
0s          Normal    Pulling                pod/test-full-node-1             Pulling image "frolvlad/alpine-glibc:latest"
0s          Normal    Pulled                 pod/test-full-node-1             Successfully pulled image "frolvlad/alpine-glibc:latest" in 1.558s (1.558s including waiting). Image size: 7922050 bytes.
0s          Normal    Created                pod/test-full-node-1             Created container test-container
0s          Normal    Started                pod/test-full-node-1             Started container test-container
0s          Normal    Killing                pod/test-full-node-1             Stopping container test-container
0s          Warning   FailedScheduling       pod/test-full-node-2             0/5 nodes are available: 2 node does not match annotation, 3 node(s) had untolerated taint(s).
0s          Warning   FailedScheduling       pod/test-full-node-2             running PreFilter plugin "SlurmBridge": no nodes assigned to job
0s          Normal    Scheduled              pod/test-full-node-2             Successfully assigned default/test-full-node-3 to gpu-dp-k66v9-58pk5
0s          Warning   UnexpectedAdmissionError   pod/test-full-node-2             Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 8, Available: 0, which is unexpected

The important error is the final UnexpectedAdmissionError warning.

Expected Behavior

The placeholder job fully overlaps with the lifecycle of the pod to prevent this issue. It starts before the pod and ends after the pod.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions