[Enhancement] Escalate stage timeout to error#1558
[Enhancement] Escalate stage timeout to error#1558pi314ever wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2419589c8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
vllm_omni/entrypoints/omni.py
Outdated
| self.close() | ||
| raise TimeoutError |
There was a problem hiding this comment.
Ensure timeout path actually cleans up started workers
This timeout path runs inside OmniBase.__init__ (which calls _start_stages then _wait_for_stages_ready), but _weak_finalizer is only set later in Omni.__init__/AsyncOmni.__init__ after super().__init__ returns (vllm_omni/entrypoints/omni.py and vllm_omni/entrypoints/async_omni.py). In the timeout case, self.close() is therefore a no-op and the subsequent raise TimeoutError exits construction with already-started stage processes/queues still alive, which can leak orphan workers and IPC resources whenever initialization times out.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
@pi314ever @hsliuustc0106 I think codex bot's comment makes sense here. raise here makes _weak_finalizer loses its functionality and makes leak of sub-processes and other resources.
There was a problem hiding this comment.
@wuhang2014 I can think of two solutions to this.
First is to delay raising timeout error until after weakref is set up. This is the least code change but will be confusing as the place where timeout is recognized is not the same as where it would be triggered (loss of locality).
Second is to set up the weakref finalizer as early in the init pipeline as possible. This is in-line with vLLM's MPClient implementation where background resources are bundled together such that they can be finalized together. This would require some code change to generalize the background resources concept. The try-finally block that checks for proper init and calls finalizer if not successful is also a good pattern to introduce here.
What are your thoughts on each approach?
There was a problem hiding this comment.
The 2nd solution looks good to me. Alignment to vLLM is least bad idea.
vllm_omni/entrypoints/omni.py
Outdated
| return profiler is not None | ||
| logger.error(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}") | ||
| self.close() | ||
| raise TimeoutError |
There was a problem hiding this comment.
@hsliuustc0106 , would like to check your feedback on this. Thanks
| formatted_suggestions = "\n".join(f" {i + 1}) {msg}" for i, msg in enumerate(suggestions)) | ||
|
|
||
| logger.warning(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}") | ||
| logger.error(f"[{self._name}] Stage initialization timeout. Troubleshooting Steps:\n{formatted_suggestions}") |
There was a problem hiding this comment.
The suggestions list (line 481) still says "Ignore this warning if the model weight download / load from disk time is longer than {timeout}s." — that does not make sense now that the timeout is fatal. Drop it or reword to recommend increasing stage_init_timeout.
There was a problem hiding this comment.
I changed the suggestion to increase timeout allowance if model weights download/load takes a long time.
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple of comments. Also:
test_wait_for_stages_ready_timeout (and the diffusion variant) currently assert len(omni._stages_ready) == 0 after constructing Omni(...) — that will blow up now since __init__ raises TimeoutError before returning. Needs a pytest.raises(TimeoutError) wrapper.
+1 on the codex bot's point about self.close() being a no-op here — _weak_finalizer isn't registered until Omni.__init__ returns from super().__init__(), so the cleanup never fires.
Co-authored-by: SYLAR <125541396+lishunyang12@users.noreply.github.com> Signed-off-by: Daniel Huang <pilotflyer824@gmail.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
|
@lishunyang12 I addressed your comments and fixed pytests. Regarding |
| self.close() | ||
| raise TimeoutError( | ||
| f"{self._name}: {len(self._stages_ready)}/{num_stages} stages ready after {timeout}s. Missing stages: {not_ready}" | ||
| ) |
There was a problem hiding this comment.
Resource leak: self.close() here is a no-op because _weak_finalizer is not registered until Omni.__init__ returns from super().__init__() (see omni.py:524 and async_omni.py). The timeout occurs inside super().__init__(), so cleanup never fires and orphan workers/IPC resources leak. Consider setting up finalizer earlier or using try-finally pattern as discussed in comments.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Escalates a stage timeout error to error (previously warning only). This prevents potential invalid orchestrator state of orchestrator reporting ready while individual stages are hanging or dead. Component of #1557 relating to issue #1346
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)