[Enhancement] Escalate stage timeout to error by pi314ever · Pull Request #1558 · vllm-project/vllm-omni

pi314ever · 2026-02-27T21:29:25Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Escalates a stage timeout error to error (previously warning only). This prevents potential invalid orchestrator state of orchestrator reporting ready while individual stages are hanging or dead. Component of #1557 relating to issue #1346

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2419589c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T21:31:28Z

vllm_omni/entrypoints/omni.py

+        self.close()
+        raise TimeoutError


Ensure timeout path actually cleans up started workers

This timeout path runs inside OmniBase.__init__ (which calls _start_stages then _wait_for_stages_ready), but _weak_finalizer is only set later in Omni.__init__/AsyncOmni.__init__ after super().__init__ returns (vllm_omni/entrypoints/omni.py and vllm_omni/entrypoints/async_omni.py). In the timeout case, self.close() is therefore a no-op and the subsequent raise TimeoutError exits construction with already-started stage processes/queues still alive, which can leak orphan workers and IPC resources whenever initialization times out.

Useful? React with 👍 / 👎.

@pi314ever @hsliuustc0106 I think codex bot's comment makes sense here. raise here makes _weak_finalizer loses its functionality and makes leak of sub-processes and other resources.

@wuhang2014 I can think of two solutions to this.

First is to delay raising timeout error until after weakref is set up. This is the least code change but will be confusing as the place where timeout is recognized is not the same as where it would be triggered (loss of locality).

Second is to set up the weakref finalizer as early in the init pipeline as possible. This is in-line with vLLM's MPClient implementation where background resources are bundled together such that they can be finalized together. This would require some code change to generalize the background resources concept. The try-finally block that checks for proper init and calls finalizer if not successful is also a good pattern to introduce here.

What are your thoughts on each approach?

The 2nd solution looks good to me. Alignment to vLLM is least bad idea.

I integrated a BackgroundResources with that manages the storage and proper shutdown of all background resources spawned by Omni and AsyncOmni. This consolidates the _weakref_finalizer into one implementation. I also added a similar init mechanism that cleans up processes if errors occur during the init process.

pi314ever · 2026-02-27T22:58:07Z

@xuechendi

vllm_omni/entrypoints/omni.py

lishunyang12

Left a couple of comments. Also:

test_wait_for_stages_ready_timeout (and the diffusion variant) currently assert len(omni._stages_ready) == 0 after constructing Omni(...) — that will blow up now since __init__ raises TimeoutError before returning. Needs a pytest.raises(TimeoutError) wrapper.

+1 on the codex bot's point about self.close() being a no-op here — _weak_finalizer isn't registered until Omni.__init__ returns from super().__init__(), so the cleanup never fires.

pi314ever · 2026-02-28T08:00:25Z

@lishunyang12 I addressed your comments and fixed pytests. Regarding _weak_finalizer, I proposed two ideas above. Are any of those ideas acceptable to you?

hsliuustc0106 · 2026-03-01T13:12:48Z

vllm_omni/entrypoints/omni.py

+        self.close()
+        raise TimeoutError(
+            f"{self._name}: {len(self._stages_ready)}/{num_stages} stages ready after {timeout}s. Missing stages: {not_ready}"
+        )


Resource leak: self.close() here is a no-op because _weak_finalizer is not registered until Omni.__init__ returns from super().__init__() (see omni.py:524 and async_omni.py). The timeout occurs inside super().__init__(), so cleanup never fires and orphan workers/IPC resources leak. Consider setting up finalizer earlier or using try-finally pattern as discussed in comments.

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever · 2026-03-03T17:41:50Z

@hsliuustc0106 @lishunyang12 I added in a BackgroundResources dataclass that takes care of resource management by Omni/AsyncOmni classes. What are your thoughts on this implementation?

pi314ever requested a review from hsliuustc0106 as a code owner February 27, 2026 21:29

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

This was referenced Feb 27, 2026

[Enhancement] Resolve Various Hanging Issues During Init Process #1557

Closed

[RFC]: Exit on OOM #1346

Open

xuechendi reviewed Feb 27, 2026

View reviewed changes

vllm_omni/entrypoints/omni.py Outdated Show resolved Hide resolved

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

vllm_omni/entrypoints/omni.py Outdated Show resolved Hide resolved

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

vllm_omni/entrypoints/omni.py Show resolved Hide resolved

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

hsliuustc0106 reviewed Mar 1, 2026

View reviewed changes

pi314ever and others added 7 commits March 2, 2026 23:42

Escalate stage timeout to error

3d8c6de

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Update vllm_omni/entrypoints/omni.py

4d17b59

Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>

Reword suggestions and timeout error message

a4ca950

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Fix test timeout expectations

1592ada

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Resolve line too long

7ac89fe

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Add proto BackgroundResources

782eba2

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

Integrate BackgroundResources into Omni and AsyncOmni

a58d38e

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>

pi314ever force-pushed the raise-stage-timeout branch from 1d35c48 to a58d38e Compare March 3, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Escalate stage timeout to error#1558

[Enhancement] Escalate stage timeout to error#1558
pi314ever wants to merge 7 commits intovllm-project:mainfrom
pi314ever:raise-stage-timeout

pi314ever commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Uh oh!

wuhang2014 Feb 28, 2026

Uh oh!

pi314ever Feb 28, 2026

Uh oh!

wuhang2014 Feb 28, 2026

Uh oh!

pi314ever Mar 3, 2026

Uh oh!

pi314ever commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment

Uh oh!

pi314ever commented Feb 28, 2026

Uh oh!

hsliuustc0106 Mar 1, 2026

Uh oh!

pi314ever commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pi314ever commented Feb 27, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

wuhang2014 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pi314ever Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

wuhang2014 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pi314ever Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

pi314ever commented Feb 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

pi314ever commented Feb 28, 2026

Uh oh!

hsliuustc0106 Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

pi314ever commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants