Commit 2b97c69
authored
[v0.9.1][Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2085)
### What this PR does / why we need it?
This PR addresses a critical issue where Node D (Device) failures cause
Node P (Processor) to hang due to inability to release KV cache.
**Trigger Scenarios:**
1. Node D fails mid-inference (e.g., network disconnection)
2. Node D rejects requests at a certain stage (e.g., via API server)
3. Load-test script termination causes Node P or D to abort queued
requests
**Root Cause Analysis:**
1. Currently, Node D sends a "KV cache pull complete, release approved"
message to Node P
2. This message is transmitted via the worker connector. If PD
connection breaks or requests are rejected upstream, Node D cannot send
the message
3. Node P will never release KV cache without receiving this message
**Solution:**
Following VLLM community's approach (NIXL connector timeout mechanism),
we're implementing:
- A timeout mechanism with comprehensive warnings
- Updated README documentation
- Reference: VLLM's optimization PR
[#20139](vllm-project/vllm#20139)
**Note:** The full disaster recovery solution is still in design. This
PR will be merged into v091-dev branch simply but will evolve in main
([PR #2174](#2174)).
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
---------
Signed-off-by: underfituu <[email protected]>1 parent 741a8cf commit 2b97c69
File tree
2 files changed
+36
-2
lines changed- vllm_ascend
- distributed
2 files changed
+36
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
2 | 3 | | |
3 | 4 | | |
4 | 5 | | |
| |||
183 | 184 | | |
184 | 185 | | |
185 | 186 | | |
| 187 | + | |
186 | 188 | | |
187 | 189 | | |
188 | 190 | | |
| |||
247 | 249 | | |
248 | 250 | | |
249 | 251 | | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
250 | 256 | | |
| 257 | + | |
251 | 258 | | |
252 | 259 | | |
253 | 260 | | |
| |||
271 | 278 | | |
272 | 279 | | |
273 | 280 | | |
| 281 | + | |
274 | 282 | | |
275 | 283 | | |
276 | 284 | | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
277 | 289 | | |
278 | 290 | | |
279 | 291 | | |
| |||
340 | 352 | | |
341 | 353 | | |
342 | 354 | | |
| 355 | + | |
343 | 356 | | |
344 | 357 | | |
345 | 358 | | |
| |||
383 | 396 | | |
384 | 397 | | |
385 | 398 | | |
386 | | - | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
387 | 402 | | |
388 | 403 | | |
389 | 404 | | |
| |||
606 | 621 | | |
607 | 622 | | |
608 | 623 | | |
| 624 | + | |
609 | 625 | | |
610 | 626 | | |
611 | 627 | | |
| |||
860 | 876 | | |
861 | 877 | | |
862 | 878 | | |
863 | | - | |
| 879 | + | |
| 880 | + | |
864 | 881 | | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
865 | 893 | | |
866 | 894 | | |
867 | 895 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
136 | 142 | | |
137 | 143 | | |
138 | 144 | | |
| |||
0 commit comments