[Frontend] Enable drain shutdown mode for non-DP deployments#32420
[Frontend] Enable drain shutdown mode for non-DP deployments#32420wseaton wants to merge 56 commits intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces a graceful shutdown mechanism, which is a crucial feature for production deployments. The implementation is thorough, considering not only in-flight requests but also pending asynchronous KV cache transfers. The changes are well-structured, with clear separation of concerns between the launcher, engine, and scheduler. The addition of readiness and liveness probes (/health and /live) is a great touch and follows best practices for services running on platforms like Kubernetes. The related tests for the new scheduler logic are comprehensive and correctly validate the new functionality. Overall, this is a high-quality contribution that significantly improves the robustness of the vLLM server.
| async def graceful_signal_handler() -> None: | ||
| """Async wrapper for graceful shutdown.""" | ||
| await graceful_drain() | ||
| signal_handler() |
There was a problem hiding this comment.
Graceful shutdown may fail to terminate server on exception
High Severity
The graceful_signal_handler() function calls await graceful_drain() before signal_handler(), but lacks a try-finally block. If graceful_drain() raises an exception before its inner try block (lines 98-114 access engine_client.model_config, call get_num_unfinished_requests(), and call set_server_draining()), signal_handler() is never invoked. This causes the server to hang indefinitely on SIGTERM/SIGINT instead of shutting down.
Additional Locations (1)
|
Needs a rebase, integration test and a docs update, should be able to get to those today |
|
Documentation preview: https://vllm--32420.org.readthedocs.build/en/32420/ |
1e542ab to
44d3b6d
Compare
|
Hi @wseaton, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
markmc
left a comment
There was a problem hiding this comment.
Great stuff, Will 👍
High-level comments ...
Thinking about terminology, we’re introducing two non-obvious (at not obviously related!) terms here - “graceful” and “drain”. Also, drain timeout isn’t obviously related to shutdown.
Also related, adding this as a new “shutdown mode” would give more flexibility for other future options. So, I suggest:
--shutdown-mode <immediate|drain> --shutdown-drain-timeout <default:120s>
Unless I’m missing something, the HTTP 503 return on /health is a good start for e.g. load-balancers to detect draining pods etc. - I’d like to remove the Prometheus metric from this PR, and consider it separately. I’m not totally against it, it’s just that it has the potential to be a rabbit hole in its own right.
Also, expand the PR description with more background please 👍
Caveat - I want to think some more about the engine core signal handling, and the death pipe stuff. In general, I don’t think the child processes of the frontend API server should respond to SIGINT/SIGTERM at all, instead let only the parent control the child lifecycle. And I think that’s also true in the library case too.
I am torn on the sole use of death pipe, think it is cleaner to add IPC signaling and use the existing control channel that we have since this tells us that the parent process is triggering the shutdown on it's own volition vs. a crash. We still do need the death pipe as a fallback for if the parent process crashes, but overall this is cleaner. |
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
…mq asserts, better comments Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
There's no reason to use getattr() or repeat the default values for
args, since they will always be present with the default specified
in their declaration.
Fixes:
```
File "/home/markmc/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 338, in validate_parsed_serve_args
if getattr(args, "api_server_count", 1) > 1:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'NoneType' and 'int'
```
Signed-off-by: Mark McLoughlin <[email protected]>
3dd5ecc to
6be5f50
Compare
6be5f50 to
565650b
Compare
…en TP>1 by isolating process with setpgrp() Signed-off-by: Will Eaton <[email protected]>
565650b to
ac3cb2e
Compare
|
|
||
| engine_core: EngineCoreProc | None = None | ||
| if shutdown_pipe is not None: | ||
| # New process group so terminal Ctrl-C doesn't kill workers. |
There was a problem hiding this comment.
Not just the workers, the engine is also in this new process group
| engine_core: EngineCoreProc | None = None | ||
| if shutdown_pipe is not None: | ||
| # New process group so terminal Ctrl-C doesn't kill workers. | ||
| os.setpgrp() |
There was a problem hiding this comment.
This would also seem appropriate to do in the API server where num_api_servers > 1 but I guess it's not as big a deal with the immediate shutdown mode (and we don't yet support multiple API servers with the drain shutdown mode)
| # New process group so terminal Ctrl-C doesn't kill workers. | ||
| os.setpgrp() | ||
| else: | ||
| signal.signal(signal.SIGINT, signal_handler) |
There was a problem hiding this comment.
There's actually no case where shutdown_pipe is None right? (But we allow it in the type hint)
There was a problem hiding this comment.
This is correct currently yes
|
Future follow-ups:
From an earlier comment:
|
|
I've investigated the With some effort, I was able to reproduce the following test passing on main and failing with this PR: The first test passes, and the second fails at engine startup with: and a To reproduce locally, I had to:
|
|
Thanks @markmc for the test reproduction, I'll take a look tomorrow EST and see if I can't narrow it down. |
Using VllmRunner in a @create_new_process_for_each_test("fork") test
the engine core was not being properly cleaned up because the GC
doesn't have time to collect the deleted LLM object and the
kill(TERM) to the process group in fork_new_process_for_each_test()
no longer works because the engine is now in its own process group.
Instead, let's add an explicit shutdown method() to LLM and
invoke this to trigger engine cleanup.
Signed-off-by: Mark McLoughlin <[email protected]>
|
It looks like we've been relying on (in forked tests) a |
|
Thanks for all of the work on this @wseaton! I'm still digesting this fully but have a few initial concerns/thoughts:
I was wondering whether we could use the same mechanism as in 34125 to handle draining within the engine. So the flow could be like:
We should make this impossible, maybe not directly part of this PR but child procs should immediately exit if parent dies first.
I am a bit concerned about adding explicit shutdown to the tests because I've found that in the past this masks GC issues, usually some circular ref which is preventing implicit GC/shutdown. For example we want ideally want vLLM to clean itself up if the |
| shutdown_mode: Literal["immediate", "drain"] = "immediate" | ||
| """Shutdown mode: 'immediate' exits immediately on SIGTERM (default), | ||
| 'drain' waits for in-flight requests to complete.""" | ||
| shutdown_drain_timeout: int = 120 | ||
| """Seconds to wait for in-flight requests to complete during drain | ||
| shutdown mode.""" |
There was a problem hiding this comment.
Could we just have a single drain timeout arg? (0 means immediate)
Just thinking aloud but another possibility is to call it shutdown_grace_period
Supercedes #32334
This is an attempt at a non default behavior changing implementation of the RFC by @markmc for graceful shutdown: #24885
Sequence diagram showing the drain shutdown flow:
sequenceDiagram participant Client as Client Requests participant API as API Server participant Launcher as Signal Handler participant Core as EngineCore Note over Client,Core: Normal Operation Client->>API: Incoming requests API->>Core: Forward to engine Note over Client,Core: First SIGTERM - Graceful Shutdown Launcher->>Launcher: SIGTERM received Launcher->>API: set_rejecting_requests(True) API--xClient: 503 Service Unavailable Launcher->>Core: DRAIN via shutdown pipe Core->>Core: _drain_requested = True loop Busy loop continues Core->>Core: Process remaining requests end Note over Core: scheduler.has_requests() = False Core->>Core: "Drain complete, exiting" Core->>Core: Process exits Launcher->>Launcher: engine_dead detected Launcher->>API: Cancel server task Note over Client,Core: Clean shutdown Note over Client,Core: Second SIGTERM - Immediate Shutdown rect rgb(255, 230, 230) Launcher->>Launcher: SIGTERM while draining Launcher->>API: Cancel server task immediately end