Skip to content

Commit 3a27b15

Browse files
authored
[bugfix] Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example (#3287)
### What this PR does / why we need it? Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example. For large-parameter models of Qwen3-30B and above, weight loading alone takes 4 to 5 minutes. Therefore, the 5-minute timeout in the current example code implementation is too short, causing some DP instances to be killed prematurely and eventually stuck in the DP synchronization all-reduce operation. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA vLLM version: v0.11.0rc3 vLLM main: vllm-project/vllm@releases/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@releases/v0.11.0 --------- Signed-off-by: leo-pony <[email protected]>
1 parent a486ff8 commit 3a27b15

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

examples/offline_data_parallel.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -244,10 +244,10 @@ def start(rank):
244244
procs.append(proc)
245245
exit_code = 0
246246
for proc in procs:
247-
proc.join(timeout=300)
247+
proc.join(timeout=900)
248248
if proc.exitcode is None:
249249
print(
250-
f"Killing process {proc.pid} that didn't stop within 5 minutes."
250+
f"Killing process {proc.pid} that didn't stop within 15 minutes."
251251
)
252252
proc.kill()
253253
exit_code = 1

0 commit comments

Comments
 (0)