Skip to content

[Frontend] Enable drain shutdown mode for non-DP deployments#32420

Open
wseaton wants to merge 56 commits intovllm-project:mainfrom
wseaton:kv-transfer-drain
Open

[Frontend] Enable drain shutdown mode for non-DP deployments#32420
wseaton wants to merge 56 commits intovllm-project:mainfrom
wseaton:kv-transfer-drain

Conversation

@wseaton
Copy link
Contributor

@wseaton wseaton commented Jan 15, 2026

Supercedes #32334

This is an attempt at a non default behavior changing implementation of the RFC by @markmc for graceful shutdown: #24885

Sequence diagram showing the drain shutdown flow:

sequenceDiagram
    participant Client as Client Requests
    participant API as API Server
    participant Launcher as Signal Handler
    participant Core as EngineCore

    Note over Client,Core: Normal Operation
    Client->>API: Incoming requests
    API->>Core: Forward to engine

    Note over Client,Core: First SIGTERM - Graceful Shutdown
    Launcher->>Launcher: SIGTERM received
    Launcher->>API: set_rejecting_requests(True)
    API--xClient: 503 Service Unavailable
    
    Launcher->>Core: DRAIN via shutdown pipe
    Core->>Core: _drain_requested = True
    loop Busy loop continues
        Core->>Core: Process remaining requests
    end
    Note over Core: scheduler.has_requests() = False
    Core->>Core: "Drain complete, exiting"
    Core->>Core: Process exits
    Launcher->>Launcher: engine_dead detected

    Launcher->>API: Cancel server task
    Note over Client,Core: Clean shutdown

    Note over Client,Core: Second SIGTERM - Immediate Shutdown
    rect rgb(255, 230, 230)
        Launcher->>Launcher: SIGTERM while draining
        Launcher->>API: Cancel server task immediately
    end
Loading

@wseaton wseaton changed the title [2/x][Frontend] Also consider async kv transfers in flight for graceful drain decision [2/x][Frontend] Draft: Also consider async kv transfers in flight for graceful drain decision Jan 15, 2026
@mergify
Copy link

mergify bot commented Jan 15, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wseaton.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 15, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a graceful shutdown mechanism, which is a crucial feature for production deployments. The implementation is thorough, considering not only in-flight requests but also pending asynchronous KV cache transfers. The changes are well-structured, with clear separation of concerns between the launcher, engine, and scheduler. The addition of readiness and liveness probes (/health and /live) is a great touch and follows best practices for services running on platforms like Kubernetes. The related tests for the new scheduler logic are comprehensive and correctly validate the new functionality. Overall, this is a high-quality contribution that significantly improves the robustness of the vLLM server.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment @cursor review or bugbot run to trigger another review on this PR

async def graceful_signal_handler() -> None:
"""Async wrapper for graceful shutdown."""
await graceful_drain()
signal_handler()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful shutdown may fail to terminate server on exception

High Severity

The graceful_signal_handler() function calls await graceful_drain() before signal_handler(), but lacks a try-finally block. If graceful_drain() raises an exception before its inner try block (lines 98-114 access engine_client.model_config, call get_num_unfinished_requests(), and call set_server_draining()), signal_handler() is never invoked. This causes the server to hang indefinitely on SIGTERM/SIGINT instead of shutting down.

Additional Locations (1)

Fix in Cursor Fix in Web

@wseaton wseaton changed the title [2/x][Frontend] Draft: Also consider async kv transfers in flight for graceful drain decision [Frontend] Enable graceful scaledown Jan 19, 2026
@wseaton
Copy link
Contributor Author

wseaton commented Jan 19, 2026

Needs a rebase, integration test and a docs update, should be able to get to those today

@mergify
Copy link

mergify bot commented Jan 19, 2026

Documentation preview: https://vllm--32420.org.readthedocs.build/en/32420/

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 19, 2026
@mergify mergify bot removed the needs-rebase label Jan 19, 2026
@mergify
Copy link

mergify bot commented Jan 19, 2026

Hi @wseaton, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, Will 👍

High-level comments ...

Thinking about terminology, we’re introducing two non-obvious (at not obviously related!) terms here - “graceful” and “drain”. Also, drain timeout isn’t obviously related to shutdown.

Also related, adding this as a new “shutdown mode” would give more flexibility for other future options. So, I suggest:

--shutdown-mode <immediate|drain> --shutdown-drain-timeout <default:120s>

Unless I’m missing something, the HTTP 503 return on /health is a good start for e.g. load-balancers to detect draining pods etc. - I’d like to remove the Prometheus metric from this PR, and consider it separately. I’m not totally against it, it’s just that it has the potential to be a rabbit hole in its own right.

Also, expand the PR description with more background please 👍

Caveat - I want to think some more about the engine core signal handling, and the death pipe stuff. In general, I don’t think the child processes of the frontend API server should respond to SIGINT/SIGTERM at all, instead let only the parent control the child lifecycle. And I think that’s also true in the library case too.

@wseaton
Copy link
Contributor Author

wseaton commented Jan 22, 2026

Caveat - I want to think some more about the engine core signal handling, and the death pipe stuff. In general, I don’t think the child processes of the frontend API server should respond to SIGINT/SIGTERM at all, instead let only the parent control the child lifecycle. And I think that’s also true in the library case too.

I am torn on the sole use of death pipe, think it is cleaner to add IPC signaling and use the existing control channel that we have since this tells us that the parent process is triggering the shutdown on it's own volition vs. a crash.

We still do need the death pipe as a fallback for if the parent process crashes, but overall this is cleaner.

wseaton and others added 12 commits February 9, 2026 20:45
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
…mq asserts, better comments

Co-authored-by: Mark McLoughlin <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
Signed-off-by: Will Eaton <[email protected]>
There's no reason to use getattr() or repeat the default values for
args, since they will always be present with the default specified
in their declaration.

Fixes:

```
File "/home/markmc/vllm-project/vllm/vllm/entrypoints/openai/cli_args.py", line 338, in validate_parsed_serve_args
  if getattr(args, "api_server_count", 1) > 1:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>' not supported between instances of 'NoneType' and 'int'
```

Signed-off-by: Mark McLoughlin <[email protected]>
…en TP>1 by isolating process with setpgrp()

Signed-off-by: Will Eaton <[email protected]>
Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it definitely seems to be behaving better with Ctrl-C now


engine_core: EngineCoreProc | None = None
if shutdown_pipe is not None:
# New process group so terminal Ctrl-C doesn't kill workers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just the workers, the engine is also in this new process group

engine_core: EngineCoreProc | None = None
if shutdown_pipe is not None:
# New process group so terminal Ctrl-C doesn't kill workers.
os.setpgrp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also seem appropriate to do in the API server where num_api_servers > 1 but I guess it's not as big a deal with the immediate shutdown mode (and we don't yet support multiple API servers with the drain shutdown mode)

# New process group so terminal Ctrl-C doesn't kill workers.
os.setpgrp()
else:
signal.signal(signal.SIGINT, signal_handler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually no case where shutdown_pipe is None right? (But we allow it in the type hint)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct currently yes

@markmc
Copy link
Member

markmc commented Feb 10, 2026

Future follow-ups:

  • --api-server-count > 1 drain support
  • --data-parallel-size > 1 drain support
  • The FIXME in run_headless() - using a pipe to break of join_first() when a signal is received by the parent
  • Put the API servers in their own process group in the multi-api-server case, so Ctrl-C only signals the parent
  • Revisit vllm:server_draining gauge if you think they're a strong case for it

From an earlier comment:

In the multi-api-server scenario, I think I expect:

  1. Top-level parent process receives SIGTERM
  2. Parent instructs API servers to reject new work
  3. Parent instructs engine processes to drain and shutdown
  4. Parent waits for the engines to terminate
  5. Parent instructs the API server processes to shutdown
  6. Parent exits when all API servers have terminated

@markmc markmc changed the title [Frontend] Enable drain scaledown mode for single process deployments [Frontend] Enable drain shutdown mode for non-DP deployments Feb 10, 2026
@markmc
Copy link
Member

markmc commented Feb 10, 2026

@njhill asked:

Could we hold off ~ 1 more day? one thing is that I am helping with pausing for RL update PR #34125 and want to check for synergies w.r.t. waiting for running reqs to finish

@markmc
Copy link
Member

markmc commented Feb 11, 2026

I've investigated the basic-models-tests-other test failure

With some effort, I was able to reproduce the following test passing on main and failing with this PR:

pytest -v -s tests/models/test_terratorch.py::test_inference[ibm-nasa-geospatial/Prithvi-EO-2.0-300M-TL-Sen1Floods11] tests/models/test_terratorch.py::test_inference[mgazz/Prithvi_v2_eo_300_tl_unet_agb]

The first test passes, and the second fails at engine startup with:

(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/executor/uniproc_executor.py", line 47, in _init_executor
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     self.driver_worker.init_device()
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/worker/worker_base.py", line 322, in init_device
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/worker/gpu_worker.py", line 253, in init_device
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]   File "/home/markmc/vllm-project/vllm/vllm/v1/worker/utils.py", line 102, in request_memory
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059]     raise ValueError(
(EngineCore_DP0 pid=1011625) ERROR 02-11 08:40:57 [core.py:1059] ValueError: Free memory on device cuda:0 (33.8/39.38 GiB) on startup is less than desired GPU memory utilization (0.9, 35.44 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.

and a VLLM::EngineCore is left hanging around holding GPU memory

To reproduce locally, I had to:

  • pip install -U 'terratorch @ git+https://github.com/IBM/terratorch.git@07184fcf91a1324f831ff521dd238d97fe350e3e'
  • pip uninstall xformers - having this installed was causing We must use spawn... Reasons: CUDA is initialized and the issue didn't reproduce
  • I always need [Test] Fix pytest termination with @create_new_process_for_each_test("fork") #29130 to make forked tests work on my machine, I've never figured out exactly why

@wseaton
Copy link
Contributor Author

wseaton commented Feb 12, 2026

Thanks @markmc for the test reproduction, I'll take a look tomorrow EST and see if I can't narrow it down.

Using VllmRunner in a @create_new_process_for_each_test("fork") test
the engine core was not being properly cleaned up because the GC
doesn't have time to collect the deleted LLM object and the
kill(TERM) to the process group in fork_new_process_for_each_test()
no longer works because the engine is now in its own process group.

Instead, let's add an explicit shutdown method() to LLM and
invoke this to trigger engine cleanup.

Signed-off-by: Mark McLoughlin <[email protected]>
@markmc
Copy link
Member

markmc commented Feb 12, 2026

It looks like we've been relying on (in forked tests) a SIGTERM to the engine core in order to cleanup. I've attempted to fix with:

    [LLMEngine] Add shutdown() method for expliticit shutdown
    
    Using VllmRunner in a @create_new_process_for_each_test("fork") test
    the engine core was not being properly cleaned up because the GC
    doesn't have time to collect the deleted LLM object and the
    kill(TERM) to the process group in fork_new_process_for_each_test()
    no longer works because the engine is now in its own process group.
    
    Instead, let's add an explicit shutdown method() to LLM and
    invoke this to trigger engine cleanup.

@njhill
Copy link
Member

njhill commented Feb 12, 2026

Thanks for all of the work on this @wseaton!

I'm still digesting this fully but have a few initial concerns/thoughts:

  • The changes feel a bit invasive to some of the core logic, make some of the core methods messier
  • I know this is only covers single-process mode, but do we have a clear understanding of how this will generalize to multi-api-server / DP, etc. If not I wonder whether this makes sense to merge since we may need to completely rework it anyhow?
  • I have been helping with request pausing for RL the last couple of days which I think has some overlap and it would be good to unify. In particular that PR deals with the multi-api-server and DP cases.
  • I think multi-node probably needs separate consideration?
  • Also how about VLLM_ENABLE_V1_MULTIPROCESSING=0 case (no front-end/engine proc separation)

I was wondering whether we could use the same mechanism as in 34125 to handle draining within the engine. So the flow could be like:

  • Engines accept utility shutdown method with grace period (maybe via pipe like you have). Waits for in-progress requests to finish until deadline and then exits process. If deadline reached with in-flight requests, aborts those requests and then exits.
  • Shutdown process coordinated by CoreEngineProcManager - it can set a flag when shutting down. It monitors all of the engine procs.. when any die unexpectedly it should kill others (actually this might depend on whether it's MoE), or if expected (shutdown flag is set), then continues to wait until all have exited. If not all exited prior to timeout then kill them.
    • I think the mechanism we have for waiting for requests to finish will naturally work with DP since when the engines depend on each other they will still behave as if there are requests in flight (and so won't exit until all ranks are finished).
  • In 34125 we just queue requests while draining but it would be a small change to have a flag to reject new requests. This could be done in the engine and so then would require minimal front-end changes (and multi-api-server agnostic).

a VLLM::EngineCore is left hanging around holding GPU memory

We should make this impossible, maybe not directly part of this PR but child procs should immediately exit if parent dies first.

the engine core was not being properly cleaned up because the GC doesn't have time to collect the deleted LLM object

I am a bit concerned about adding explicit shutdown to the tests because I've found that in the past this masks GC issues, usually some circular ref which is preventing implicit GC/shutdown. For example we want ideally want vLLM to clean itself up if the LLM reference goes out of scope, or else we get various hangs / resource leaks.

Comment on lines +218 to +223
shutdown_mode: Literal["immediate", "drain"] = "immediate"
"""Shutdown mode: 'immediate' exits immediately on SIGTERM (default),
'drain' waits for in-flight requests to complete."""
shutdown_drain_timeout: int = 120
"""Seconds to wait for in-flight requests to complete during drain
shutdown mode."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just have a single drain timeout arg? (0 means immediate)

Just thinking aloud but another possibility is to call it shutdown_grace_period

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants