LoRA based multi gpu training issue: Watchdog timeout

Hey,

I am trying to do LoRA-based finetuning of the qwen3vl model. But what I have noticed it for a single GPU, it works completely fine but for multiple gpus it tends to hang up and then gets watchdog timeout of 30s

```
[rank6]:[E217 21:41:15.510799683 ProcessGroupNCCL.cpp:685] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81220, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800019 milliseconds before timing out.
[rank7]:[E217 21:41:15.512111839 ProcessGroupNCCL.cpp:685] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=81220, OpType=ALLGATHER, NumelIn=1, NumelOut=8, Timeout(ms)=1800000) ran for 1800020 milliseconds before timing out.
```

I wanted to ask did you exxperience it by chance? And is there a way to solve this? Any help would be appreciated. 

Best,
Pulkit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA based multi gpu training issue: Watchdog timeout #243

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LoRA based multi gpu training issue: Watchdog timeout #243

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions