[Enhancement] Patch OmniStage.try_collect() with _proc alive checks#1560
[Enhancement] Patch OmniStage.try_collect() with _proc alive checks#1560pi314ever wants to merge 3 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
|
@ApsarasX PTAL |
vllm_omni/entrypoints/omni_stage.py
Outdated
| request_id, engine_outputs (or engine_outputs_shm), and metrics. | ||
| """ | ||
| assert self._out_q is not None | ||
| if self._proc is not None and not self._proc.is_alive(): |
There was a problem hiding this comment.
Race condition: is_alive() is checked before get_nowait(), so if the worker finishes its last batch and exits between these two calls, you'll throw away valid results still sitting in the queue. Flip the order — try the queue first, then check is_alive() only when the queue is empty:
| if self._proc is not None and not self._proc.is_alive(): | |
| try: | |
| return self._out_q.get_nowait() | |
| except queue.Empty: | |
| pass | |
| if self._proc is not None and not self._proc.is_alive(): | |
| raise RuntimeError("OmniStage Worker process died unexpectedly") | |
| return None |
There was a problem hiding this comment.
Nice catch, fixed.
vllm_omni/entrypoints/omni_stage.py
Outdated
| """ | ||
| assert self._out_q is not None | ||
| if self._proc is not None and not self._proc.is_alive(): | ||
| raise RuntimeError("OmniStage Worker process died unexpectedly") |
There was a problem hiding this comment.
Nit: include self._proc.exitcode in the message — it is available once the process has terminated and makes debugging a lot easier.
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple comments. The main one is a race between the is_alive() check and the queue read — the current ordering can discard valid results that are already queued when the worker exits normally.
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
|
@lishunyang12 I resolved your comments with minor tweaks. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Waiting for OmniStage involves checking the output queue for results. However,
try_collect()does not check if process has died and will hang indefinitely. This fixes this issue by explicitly checking that the process is alive before attempting to read the output queue. Component of #1557 relating to issue #1346Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)