-
Notifications
You must be signed in to change notification settings - Fork 204
Open
Description
Hey,
I am trying to do LoRA-based finetuning of the qwen3vl model. But what I have noticed it for a single GPU, it works completely fine but for multiple gpus it tends to hang up and then gets watchdog timeout of 30s
[rank6]:[E217 21:41:15.510799683 ProcessGroupNCCL.cpp:685] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81220, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800019 milliseconds before timing out.
[rank7]:[E217 21:41:15.512111839 ProcessGroupNCCL.cpp:685] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81220, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800020 milliseconds before timing out.
I wanted to ask did you exxperience it by chance? And is there a way to solve this? Any help would be appreciated.
Best,
Pulkit
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels