Skip to content
This repository was archived by the owner on Mar 23, 2023. It is now read-only.

The error happened when I did multi-node distributed trainingΒ #180

@ShangWeiKuo

Description

@ShangWeiKuo

πŸ› Describe the bug

Excuse me. When I enter the command "colossalai run --nproc_per_node 4 --host [host1 ip addr],[host2 ip addr] --master_addr [host1 ip addr] train.py", I got this message: Error: failed to run torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=[host1 ip addr]:29500 --rdzv_id=colossalai-default-job train.py on [host2 ip addr]

What are the configurations I have to set in the train.py you provided with?

Environment

CUDA Version: 11.3
PyTorch Version: 1.12.0
CUDA Version in PyTorch Build: 11.3
PyTorch CUDA Version Match: βœ“
CUDA Extension: βœ“

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions