Skip to content

Commit d17079b

Browse files
rbtrtyler-lloyd
andauthored
docs: proposal for async pod delete handling (#2138)
* docs: proposal for async pod delete handling Signed-off-by: Evan Baker <[email protected]> * Update docs/feature/async-delete/readme.md Co-authored-by: Tyler Lloyd <[email protected]> Signed-off-by: Evan Baker <[email protected]> * reword based on PR feedback Signed-off-by: Evan Baker <[email protected]> --------- Signed-off-by: Evan Baker <[email protected]> Co-authored-by: Tyler Lloyd <[email protected]>
1 parent 626e16c commit d17079b

File tree

3 files changed

+96
-0
lines changed

3 files changed

+96
-0
lines changed

docs/feature/async-delete/cni.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# CNI Async Delete
2+
3+
```mermaid
4+
sequenceDiagram
5+
participant CRI
6+
participant CNI
7+
participant CNS
8+
CRI->>+CNI: Delete Pod
9+
CNI->>+CNS: Release IP
10+
alt CNS Responds
11+
alt IP Released
12+
CNS->>CNI: Released IP
13+
CNI->>CRI: Clean up Pod
14+
else Error response
15+
CNS->>CNI: Error
16+
CNI->>CRI: Delete failed, retry
17+
else CNS unresponsive
18+
CNS->>-CNI: [No response]
19+
CNI->>Filesystem queue: Write delete Pod intent
20+
Filesystem queue->>CNI:
21+
CNI->>-CRI: Clean up Pod
22+
end
23+
end
24+
```

docs/feature/async-delete/cns.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# CNS Async Delete
2+
3+
#### Components
4+
5+
```mermaid
6+
sequenceDiagram
7+
participant CNI
8+
participant Filesystem queue
9+
participant CNS
10+
loop
11+
CNS->>Filesystem queue: List-watch for Pod deletes
12+
Filesystem queue->>CNS:
13+
end
14+
CNI->>+CNS: Release IP
15+
alt CNS Responds
16+
alt IP Released
17+
CNS->>CNI: Released IP
18+
else Error response
19+
CNS->>CNI: Error
20+
else CNS unresponsive
21+
CNS->>-CNI: [No response]
22+
CNI->>Filesystem queue: Write delete Pod intent
23+
end
24+
end
25+
```
26+
27+
#### CNS Internals
28+
29+
```mermaid
30+
sequenceDiagram
31+
participant CNI
32+
participant FS Watcher
33+
participant Release IP API
34+
participant IPAM
35+
loop
36+
FS Watcher->>FS Watcher: List-watch for Pod deletes
37+
end
38+
alt Async delete events
39+
FS Watcher->>+Release IP API: Release IP
40+
else Sync delete events
41+
CNI->>Release IP API: Release IP
42+
end
43+
Release IP API->>+IPAM: Release IP
44+
alt IP Released
45+
IPAM->>Release IP API: Released IP
46+
else Error response
47+
IPAM->>-Release IP API: Error
48+
end
49+
50+
```
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
## Asyc Delete
2+
3+
### Introduction
4+
5+
In AKS with Azure CNI, the Azure CNS service manages the CNI IPAM. The `azure-vnet` CNI plugin (and any CNI using delegated IPAM through `azure-ipam`) makes IP requests to the CNS API to request an IP during Pod creation or to release an IP during Pod deletion. The CNS API is a synchronous API, which means that the IP request is not completed until the IP is allocated or released in CNS internal IPAM state.
6+
7+
There is a deadlock scenario possible when the CNS API is not available (due to daemonset rollouts or for other reason):
8+
If the Node is fully saturated with Pods (scheduled pods == maxPods), and a CNS Pod is not present (for example, a CNS daemonset rollout _deletes_ the existing Pod, then schedules the upgraded Pod), the scheduler will attempt to preempt a low priority Pod to make room for the CNS Pod. However, with no CNS Pod currently running, the CNI delete call will fail, and the Pod will be stuck in the `Terminating` state since the CRI cannot clean up the netns. The scheduler will not be able to schedule the CNS Pod, and the Node will deadlock without manual intervention to decrease the Pod pressure.
9+
10+
### Proposal
11+
12+
To address this deadlock issue, the CNI calls to CNS to release an IP address from a Pod need to be made asynchronously with a failsafe in such a way that if CNS is unavailable, it can recover these events when it does eventually start.
13+
14+
### Design
15+
16+
The CNI plugins (`azure-vnet`, `azure-ipam`) will be modified to treat a non-response from CNS during IP release as a non-fatal error and execution will proceed. A positive error response will still be treated as a real error and returned to the CRI for retry.
17+
18+
If the Pod IP release was not acknowledged by CNS, the CNI plugins will fall back to a file-system backed system to save these events. When the CNI does not get a response, it will write that Container ID to a "release queue" directory/file, and proceed with cleaning up the Pod netns.
19+
20+
When CNS starts, it will create a watch on the "release queue" directory/file, and process the Pod IDs in the queue. IPs for those Pods will then be released in CNS IPAM state.
21+
22+
This will allow the CNI to recover from the CNS unavailability, unwedging the Pod deletion process, and allowing the scheduler to start the CNS Pod to get back to steady-state.

0 commit comments

Comments
 (0)