Skip to content

error when use huggingface trainer #9

@linyubupa

Description

@linyubupa

Is this project unable to use Hugging Face's Trainer? when using trainer , i just got stuck on "Initializing global memoery buffer." and then get the error below
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=133, OpType=_ALLGATHER_BASE, NumelIn=86511616, NumelOut=173023232, Timeout(ms)=600000) ran for 600752 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions