Skip to content

LoRA based multi gpu training issue: Watchdog timeout #243

@pulkitkumar95

Description

@pulkitkumar95

Hey,

I am trying to do LoRA-based finetuning of the qwen3vl model. But what I have noticed it for a single GPU, it works completely fine but for multiple gpus it tends to hang up and then gets watchdog timeout of 30s

[rank6]:[E217 21:41:15.510799683 ProcessGroupNCCL.cpp:685] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81220, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800019 milliseconds before timing out.
[rank7]:[E217 21:41:15.512111839 ProcessGroupNCCL.cpp:685] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81220, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800020 milliseconds before timing out.

I wanted to ask did you exxperience it by chance? And is there a way to solve this? Any help would be appreciated.

Best,
Pulkit

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions