[Frontend] Enable drain shutdown mode for non-DP deployments by wseaton · Pull Request #32420 · vllm-project/vllm

wseaton · 2026-01-15T16:26:15Z

Supercedes #32334

This is an attempt at a non default behavior changing implementation of the RFC by @markmc for graceful shutdown: #24885

Sequence diagram showing the drain shutdown flow:

sequenceDiagram
    participant Client as Client Requests
    participant API as API Server
    participant Launcher as Signal Handler
    participant Core as EngineCore

    Note over Client,Core: Normal Operation
    Client->>API: Incoming requests
    API->>Core: Forward to engine

    Note over Client,Core: First SIGTERM - Graceful Shutdown
    Launcher->>Launcher: SIGTERM received
    Launcher->>API: set_rejecting_requests(True)
    API--xClient: 503 Service Unavailable
    
    Launcher->>Core: DRAIN via shutdown pipe
    Core->>Core: _drain_requested = True
    loop Busy loop continues
        Core->>Core: Process remaining requests
    end
    Note over Core: scheduler.has_requests() = False
    Core->>Core: "Drain complete, exiting"
    Core->>Core: Process exits
    Launcher->>Launcher: engine_dead detected

    Launcher->>API: Cancel server task
    Note over Client,Core: Clean shutdown

    Note over Client,Core: Second SIGTERM - Immediate Shutdown
    rect rgb(255, 230, 230)
        Launcher->>Launcher: SIGTERM while draining
        Launcher->>API: Cancel server task immediately
    end

mergify · 2026-01-15T16:27:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces a graceful shutdown mechanism, which is a crucial feature for production deployments. The implementation is thorough, considering not only in-flight requests but also pending asynchronous KV cache transfers. The changes are well-structured, with clear separation of concerns between the launcher, engine, and scheduler. The addition of readiness and liveness probes (/health and /live) is a great touch and follows best practices for services running on platforms like Kubernetes. The related tests for the new scheduler logic are comprehensive and correctly validate the new functionality. Overall, this is a high-quality contribution that significantly improves the robustness of the vLLM server.

cursor

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-15T16:36:33Z

vllm/entrypoints/launcher.py

+    async def graceful_signal_handler() -> None:
+        """Async wrapper for graceful shutdown."""
+        await graceful_drain()
+        signal_handler()


Graceful shutdown may fail to terminate server on exception

High Severity

The graceful_signal_handler() function calls await graceful_drain() before signal_handler(), but lacks a try-finally block. If graceful_drain() raises an exception before its inner try block (lines 98-114 access engine_client.model_config, call get_num_unfinished_requests(), and call set_server_draining()), signal_handler() is never invoked. This causes the server to hang indefinitely on SIGTERM/SIGINT instead of shutting down.

Additional Locations (1)

vllm/entrypoints/launcher.py#L97-L114

wseaton · 2026-01-19T15:41:41Z

Needs a rebase, integration test and a docs update, should be able to get to those today

mergify · 2026-01-19T18:10:35Z

Documentation preview: https://vllm--32420.org.readthedocs.build/en/32420/

mergify · 2026-01-19T18:17:47Z

Hi @wseaton, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

markmc

Great stuff, Will 👍

High-level comments ...

Thinking about terminology, we’re introducing two non-obvious (at not obviously related!) terms here - “graceful” and “drain”. Also, drain timeout isn’t obviously related to shutdown.

Also related, adding this as a new “shutdown mode” would give more flexibility for other future options. So, I suggest:

--shutdown-mode <immediate|drain> --shutdown-drain-timeout <default:120s>

Unless I’m missing something, the HTTP 503 return on /health is a good start for e.g. load-balancers to detect draining pods etc. - I’d like to remove the Prometheus metric from this PR, and consider it separately. I’m not totally against it, it’s just that it has the potential to be a rabbit hole in its own right.

Also, expand the PR description with more background please 👍

Caveat - I want to think some more about the engine core signal handling, and the death pipe stuff. In general, I don’t think the child processes of the frontend API server should respond to SIGINT/SIGTERM at all, instead let only the parent control the child lifecycle. And I think that’s also true in the library case too.

vllm/entrypoints/launcher.py

vllm/entrypoints/serve/middleware.py

vllm/v1/core/sched/scheduler.py

vllm/v1/engine/async_llm.py

wseaton · 2026-01-22T15:23:33Z

Caveat - I want to think some more about the engine core signal handling, and the death pipe stuff. In general, I don’t think the child processes of the frontend API server should respond to SIGINT/SIGTERM at all, instead let only the parent control the child lifecycle. And I think that’s also true in the library case too.

I am torn on the sole use of death pipe, think it is cleaner to add IPC signaling and use the existing control channel that we have since this tells us that the parent process is triggering the shutdown on it's own volition vs. a crash.

We still do need the death pipe as a fallback for if the parent process crashes, but overall this is cleaner.

vllm/v1/engine/core_client.py

Signed-off-by: Will Eaton <[email protected]>

…mq asserts, better comments Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Will Eaton <[email protected]>

Signed-off-by: Will Eaton <[email protected]>

There's no reason to use getattr() or repeat the default values for args, since they will always be present with the default specified in their declaration. Fixes: ``` File "/home/markmc/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 338, in validate_parsed_serve_args if getattr(args, "api_server_count", 1) > 1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: '>' not supported between instances of 'NoneType' and 'int' ``` Signed-off-by: Mark McLoughlin <[email protected]>

…en TP>1 by isolating process with setpgrp() Signed-off-by: Will Eaton <[email protected]>

markmc

Thanks, it definitely seems to be behaving better with Ctrl-C now

markmc · 2026-02-10T15:35:36Z

vllm/v1/engine/core.py

-
-        engine_core: EngineCoreProc | None = None
+        if shutdown_pipe is not None:
+            # New process group so terminal Ctrl-C doesn't kill workers.


Not just the workers, the engine is also in this new process group

markmc · 2026-02-10T15:39:43Z

vllm/v1/engine/core.py

-        engine_core: EngineCoreProc | None = None
+        if shutdown_pipe is not None:
+            # New process group so terminal Ctrl-C doesn't kill workers.
+            os.setpgrp()


This would also seem appropriate to do in the API server where num_api_servers > 1 but I guess it's not as big a deal with the immediate shutdown mode (and we don't yet support multiple API servers with the drain shutdown mode)

markmc · 2026-02-10T15:40:23Z

vllm/v1/engine/core.py

+            # New process group so terminal Ctrl-C doesn't kill workers.
+            os.setpgrp()
+        else:
+            signal.signal(signal.SIGINT, signal_handler)


There's actually no case where shutdown_pipe is None right? (But we allow it in the type hint)

This is correct currently yes

markmc · 2026-02-10T16:09:07Z

Future follow-ups:

--api-server-count > 1 drain support
--data-parallel-size > 1 drain support
The FIXME in run_headless() - using a pipe to break of join_first() when a signal is received by the parent
Put the API servers in their own process group in the multi-api-server case, so Ctrl-C only signals the parent
Revisit vllm:server_draining gauge if you think they're a strong case for it

From an earlier comment:

In the multi-api-server scenario, I think I expect:

Top-level parent process receives SIGTERM

Parent instructs API servers to reject new work

Parent instructs engine processes to drain and shutdown

Parent waits for the engines to terminate

Parent instructs the API server processes to shutdown

Parent exits when all API servers have terminated

markmc · 2026-02-10T17:03:33Z

@njhill asked:

Could we hold off ~ 1 more day? one thing is that I am helping with pausing for RL update PR #34125 and want to check for synergies w.r.t. waiting for running reqs to finish

markmc · 2026-02-11T13:50:15Z

I've investigated the basic-models-tests-other test failure

With some effort, I was able to reproduce the following test passing on main and failing with this PR:

pytest -v -s tests/models/test_terratorch.py::test_inference[ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11] tests/models/test_terratorch.py::test_inference[mgazz/Prithvi_v2_eo_300_tl_unet_agb]

The first test passes, and the second fails at engine startup with:

(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     self.driver_worker.init_device()
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/worker/worker_base.py", line 322, in init_device
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/worker/gpu_worker.py", line 253, in init_device
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/worker/utils.py", line 102, in request_memory
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     raise ValueError(
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059] ValueError: Free memory on device cuda:0 (33.8/39.38 GiB) on startup is less than desired GPU memory utilization (0.9, 35.44 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

and a VLLM::EngineCore is left hanging around holding GPU memory

To reproduce locally, I had to:

pip install -U 'terratorch @ git+https://github.com/IBM/terratorch.git@07184fcf91a1324f831ff521dd238d97fe350e3e'
pip uninstall xformers - having this installed was causing We must use spawn... Reasons: CUDA is initialized and the issue didn't reproduce
I always need [Test] Fix pytest termination with @create_new_process_for_each_test("fork") #29130 to make forked tests work on my machine, I've never figured out exactly why

wseaton · 2026-02-12T03:13:56Z

Thanks @markmc for the test reproduction, I'll take a look tomorrow EST and see if I can't narrow it down.

Using VllmRunner in a @create_new_process_for_each_test("fork") test the engine core was not being properly cleaned up because the GC doesn't have time to collect the deleted LLM object and the kill(TERM) to the process group in fork_new_process_for_each_test() no longer works because the engine is now in its own process group. Instead, let's add an explicit shutdown method() to LLM and invoke this to trigger engine cleanup. Signed-off-by: Mark McLoughlin <[email protected]>

markmc · 2026-02-12T15:46:35Z

It looks like we've been relying on (in forked tests) a SIGTERM to the engine core in order to cleanup. I've attempted to fix with:

    [LLMEngine] Add shutdown() method for expliticit shutdown
    
    Using VllmRunner in a @create_new_process_for_each_test("fork") test
    the engine core was not being properly cleaned up because the GC
    doesn't have time to collect the deleted LLM object and the
    kill(TERM) to the process group in fork_new_process_for_each_test()
    no longer works because the engine is now in its own process group.
    
    Instead, let's add an explicit shutdown method() to LLM and
    invoke this to trigger engine cleanup.

njhill · 2026-02-12T18:01:38Z

Thanks for all of the work on this @wseaton!

I'm still digesting this fully but have a few initial concerns/thoughts:

The changes feel a bit invasive to some of the core logic, make some of the core methods messier
I know this is only covers single-process mode, but do we have a clear understanding of how this will generalize to multi-api-server / DP, etc. If not I wonder whether this makes sense to merge since we may need to completely rework it anyhow?
I have been helping with request pausing for RL the last couple of days which I think has some overlap and it would be good to unify. In particular that PR deals with the multi-api-server and DP cases.
I think multi-node probably needs separate consideration?
Also how about VLLM_ENABLE_V1_MULTIPROCESSING=0 case (no front-end/engine proc separation)

I was wondering whether we could use the same mechanism as in 34125 to handle draining within the engine. So the flow could be like:

Engines accept utility shutdown method with grace period (maybe via pipe like you have). Waits for in-progress requests to finish until deadline and then exits process. If deadline reached with in-flight requests, aborts those requests and then exits.
Shutdown process coordinated by CoreEngineProcManager - it can set a flag when shutting down. It monitors all of the engine procs.. when any die unexpectedly it should kill others (actually this might depend on whether it's MoE), or if expected (shutdown flag is set), then continues to wait until all have exited. If not all exited prior to timeout then kill them.
- I think the mechanism we have for waiting for requests to finish will naturally work with DP since when the engines depend on each other they will still behave as if there are requests in flight (and so won't exit until all ranks are finished).
In 34125 we just queue requests while draining but it would be a small change to have a flag to reject new requests. This could be done in the engine and so then would require minimal front-end changes (and multi-api-server agnostic).

a VLLM::EngineCore is left hanging around holding GPU memory

We should make this impossible, maybe not directly part of this PR but child procs should immediately exit if parent dies first.

the engine core was not being properly cleaned up because the GC doesn't have time to collect the deleted LLM object

I am a bit concerned about adding explicit shutdown to the tests because I've found that in the past this masks GC issues, usually some circular ref which is preventing implicit GC/shutdown. For example we want ideally want vLLM to clean itself up if the LLM reference goes out of scope, or else we get various hangs / resource leaks.

njhill · 2026-02-12T18:03:00Z

vllm/entrypoints/openai/cli_args.py

+    shutdown_mode: Literal["immediate", "drain"] = "immediate"
+    """Shutdown mode: 'immediate' exits immediately on SIGTERM (default),
+    'drain' waits for in-flight requests to complete."""
+    shutdown_drain_timeout: int = 120
+    """Seconds to wait for in-flight requests to complete during drain
+    shutdown mode."""


Could we just have a single drain timeout arg? (0 means immediate)

Just thinking aloud but another possibility is to call it shutdown_grace_period

wseaton requested review from ApostaC, WoosukKwon, aarnphm, alexm-redhat, chaunceyjiang, heheda12345, markmc, njhill, robertgshaw2-redhat and ywang96 as code owners January 15, 2026 16:26

wseaton changed the title ~~[2/x][Frontend] Also consider async kv transfers in flight for graceful drain decision~~ [2/x][Frontend] Draft: Also consider async kv transfers in flight for graceful drain decision Jan 15, 2026

mergify bot added frontend v1 labels Jan 15, 2026

mergify bot added the needs-rebase label Jan 15, 2026

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

cursor bot reviewed Jan 15, 2026

View reviewed changes

wseaton mentioned this pull request Jan 19, 2026

[1/x][Frontend] A graceful shutdown implementation as per RFC #24885 #32334

Closed

wseaton changed the title ~~[2/x][Frontend] Draft: Also consider async kv transfers in flight for graceful drain decision~~ [Frontend] Enable graceful scaledown Jan 19, 2026

wseaton requested review from DarkLight1337 and NickLucche as code owners January 19, 2026 18:09

mergify bot added the documentation Improvements or additions to documentation label Jan 19, 2026

wseaton force-pushed the kv-transfer-drain branch from 1e542ab to 44d3b6d Compare January 19, 2026 18:13

mergify bot removed the needs-rebase label Jan 19, 2026

markmc requested changes Jan 22, 2026

View reviewed changes

markmc reviewed Jan 22, 2026

View reviewed changes

vllm/v1/engine/core_client.py Outdated Show resolved Hide resolved

wseaton and others added 12 commits February 9, 2026 20:45

revert signal install refactor

5526617

Signed-off-by: Will Eaton <[email protected]>

streamline logging

c467a01

Signed-off-by: Will Eaton <[email protected]>

remove code only uses for DP and headless paths

a2b45be

Signed-off-by: Will Eaton <[email protected]>

don't use exception handling for control flow

efc1fcb

Signed-off-by: Will Eaton <[email protected]>

more tidying

8ce595c

Signed-off-by: Will Eaton <[email protected]>

reintroduce the SHTUDOWN message type; get rid of "poison pill"

68ed200

Signed-off-by: Will Eaton <[email protected]>

restore assert

ac03a08

Signed-off-by: Will Eaton <[email protected]>

type hints, properly term shutdown thread, remove SysExit path, add z…

dfa558f

…mq asserts, better comments Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Will Eaton <[email protected]>

docs, fix slight unneeded formatting from rebase

2b6a957

Signed-off-by: Will Eaton <[email protected]>

add a type assertion for mypy for the ray engine case

3589ca7

Signed-off-by: Will Eaton <[email protected]>

also reject DP>1 for now

da4f028

Signed-off-by: Will Eaton <[email protected]>

wseaton force-pushed the kv-transfer-drain branch from 3dd5ecc to 6be5f50 Compare February 10, 2026 01:47

mergify bot removed the needs-rebase label Feb 10, 2026

wseaton force-pushed the kv-transfer-drain branch from 6be5f50 to 565650b Compare February 10, 2026 01:54

fix log spew on drain exit; fix Ctrl-C killing workers immediately wh…

ac3cb2e

…en TP>1 by isolating process with setpgrp() Signed-off-by: Will Eaton <[email protected]>

wseaton force-pushed the kv-transfer-drain branch from 565650b to ac3cb2e Compare February 10, 2026 01:59

markmc approved these changes Feb 10, 2026

View reviewed changes

Merge branch 'main' into kv-transfer-drain

979cf27

markmc changed the title ~~[Frontend] Enable drain scaledown mode for single process deployments~~ [Frontend] Enable drain shutdown mode for non-DP deployments Feb 10, 2026

Merge branch 'main' into kv-transfer-drain

68555dc

Merge branch 'main' into kv-transfer-drain

5b87dfd

njhill reviewed Feb 12, 2026

View reviewed changes

Uh oh!

Conversation

wseaton commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 15, 2026

Choose a reason for hiding this comment

Graceful shutdown may fail to terminate server on exception

Uh oh!

wseaton commented Jan 19, 2026

Uh oh!

mergify bot commented Jan 19, 2026

Uh oh!

mergify bot commented Jan 19, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wseaton commented Jan 22, 2026

Uh oh!

Uh oh!

markmc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markmc Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

markmc Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

markmc Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

wseaton Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

markmc commented Feb 10, 2026

Uh oh!

markmc commented Feb 10, 2026

Uh oh!

markmc commented Feb 11, 2026

Uh oh!

wseaton commented Feb 12, 2026

Uh oh!

markmc commented Feb 12, 2026

Uh oh!

njhill commented Feb 12, 2026

Uh oh!

njhill Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wseaton commented Jan 15, 2026 •

edited

Loading

markmc left a comment •

edited

Loading