-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Is this project unable to use Hugging Face's Trainer? when using trainer , i just got stuck on "Initializing global memoery buffer." and then get the error below
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=133, OpType=_ALLGATHER_BASE, NumelIn=86511616, NumelOut=173023232, Timeout(ms)=600000) ran for 600752 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels