[Bug]: Pod deletion race condition

## Description

When a pod is deleted, the placeholder Slurm job may be terminated as soon as the pod enters the Terminating state. This creates a window where the pod continues running until its terminationGracePeriodSeconds expires, while the associated Slurm resources have already been released.

There are two possible outcomes:
- new pods are scheduled onto a node that won't accept them, causing them to enter a Failed state.
- slurm-native jobs are scheduled onto resources that the terminating pod is still using, which can crash both the new slurm job and the terminating pod.

## Steps to Reproduce

This was tested in a node pool with a single replica, but has been seen in larger node pools with `pack_serial_at_end` turned on.

Pod requirements:
- Doesn't immediately terminate on sigterm
- terminationGracePeriod is long enough 

1. Create pod1
2. Delete pod1
3. Create pod2

Result:
```
$ kubectl get events
LAST SEEN   TYPE      REASON                     OBJECT                           MESSAGE
1s          Normal    AddedInterface         pod/test-full-node-1             Add eth0 [10.245.88.166/32] from k8s-pod-network
0s          Normal    Pulling                pod/test-full-node-1             Pulling image "frolvlad/alpine-glibc:latest"
0s          Normal    Pulled                 pod/test-full-node-1             Successfully pulled image "frolvlad/alpine-glibc:latest" in 1.558s (1.558s including waiting). Image size: 7922050 bytes.
0s          Normal    Created                pod/test-full-node-1             Created container test-container
0s          Normal    Started                pod/test-full-node-1             Started container test-container
0s          Normal    Killing                pod/test-full-node-1             Stopping container test-container
0s          Warning   FailedScheduling       pod/test-full-node-2             0/5 nodes are available: 2 node does not match annotation, 3 node(s) had untolerated taint(s).
0s          Warning   FailedScheduling       pod/test-full-node-2             running PreFilter plugin "SlurmBridge": no nodes assigned to job
0s          Normal    Scheduled              pod/test-full-node-2             Successfully assigned default/test-full-node-3 to gpu-dp-k66v9-58pk5
0s          Warning   UnexpectedAdmissionError   pod/test-full-node-2             Allocate failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 8, Available: 0, which is unexpected
```

The important error is the final `UnexpectedAdmissionError` warning.

## Expected Behavior

The placeholder job fully overlaps with the lifecycle of the pod to prevent this issue. It starts before the pod and ends after the pod.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Pod deletion race condition #16

Description

Steps to Reproduce

Expected Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Pod deletion race condition #16

Description

Description

Steps to Reproduce

Expected Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions