Skip to content

[Enhancement] Escalate stage timeout to error#1558

Open
pi314ever wants to merge 4 commits intovllm-project:mainfrom
pi314ever:raise-stage-timeout
Open

[Enhancement] Escalate stage timeout to error#1558
pi314ever wants to merge 4 commits intovllm-project:mainfrom
pi314ever:raise-stage-timeout

Conversation

@pi314ever
Copy link
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Escalates a stage timeout error to error (previously warning only). This prevents potential invalid orchestrator state of orchestrator reporting ready while individual stages are hanging or dead. Component of #1557 relating to issue #1346

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2419589c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +491 to +492
self.close()
raise TimeoutError

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Ensure timeout path actually cleans up started workers

This timeout path runs inside OmniBase.__init__ (which calls _start_stages then _wait_for_stages_ready), but _weak_finalizer is only set later in Omni.__init__/AsyncOmni.__init__ after super().__init__ returns (vllm_omni/entrypoints/omni.py and vllm_omni/entrypoints/async_omni.py). In the timeout case, self.close() is therefore a no-op and the subsequent raise TimeoutError exits construction with already-started stage processes/queues still alive, which can leak orphan workers and IPC resources whenever initialization times out.

Useful? React with 👍 / 👎.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pi314ever @hsliuustc0106 I think codex bot's comment makes sense here. raise here makes _weak_finalizer loses its functionality and makes leak of sub-processes and other resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuhang2014 I can think of two solutions to this.

First is to delay raising timeout error until after weakref is set up. This is the least code change but will be confusing as the place where timeout is recognized is not the same as where it would be triggered (loss of locality).

Second is to set up the weakref finalizer as early in the init pipeline as possible. This is in-line with vLLM's MPClient implementation where background resources are bundled together such that they can be finalized together. This would require some code change to generalize the background resources concept. The try-finally block that checks for proper init and calls finalizer if not successful is also a good pattern to introduce here.

What are your thoughts on each approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd solution looks good to me. Alignment to vLLM is least bad idea.

@pi314ever
Copy link
Contributor Author

@xuechendi

return profiler is not None
logger.error(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}")
self.close()
raise TimeoutError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hsliuustc0106 , would like to check your feedback on this. Thanks

formatted_suggestions = "\n".join(f" {i + 1}) {msg}" for i, msg in enumerate(suggestions))

logger.warning(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}")
logger.error(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestions list (line 481) still says "Ignore this warning if the model weight download / load from disk time is longer than {timeout}s." — that does not make sense now that the timeout is fatal. Drop it or reword to recommend increasing stage_init_timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the suggestion to increase timeout allowance if model weights download/load takes a long time.

Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. Also:

test_wait_for_stages_ready_timeout (and the diffusion variant) currently assert len(omni._stages_ready) == 0 after constructing Omni(...) — that will blow up now since __init__ raises TimeoutError before returning. Needs a pytest.raises(TimeoutError) wrapper.

+1 on the codex bot's point about self.close() being a no-op here — _weak_finalizer isn't registered until Omni.__init__ returns from super().__init__(), so the cleanup never fires.

pi314ever and others added 3 commits February 27, 2026 23:48
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com>
Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
@pi314ever
Copy link
Contributor Author

@lishunyang12 I addressed your comments and fixed pytests. Regarding _weak_finalizer, I proposed two ideas above. Are any of those ideas acceptable to you?

self.close()
raise TimeoutError(
f"{self._name}: {len(self._stages_ready)}/{num_stages} stages ready after {timeout}s. Missing stages: {not_ready}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource leak: self.close() here is a no-op because _weak_finalizer is not registered until Omni.__init__ returns from super().__init__() (see omni.py:524 and async_omni.py). The timeout occurs inside super().__init__(), so cleanup never fires and orphan workers/IPC resources leak. Consider setting up finalizer earlier or using try-finally pattern as discussed in comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants