Skip to content

Why does the distributed training get stuck here and doesn't move. #15266

Discussion options

You must be logged in to vote

When I switch the communication method from ‘nccl’ to 'gloo', it works. I don't know what the problem is, but I hope I can help you.

        ddp = DDPStrategy(process_group_backend='gloo')
        trainer = Trainer(devices="auto", accelerator="auto", strategy=ddp,
                          logger=tb_logger, log_every_n_steps=50,
                          flush_logs_every_n_steps=50, callbacks=[checkpoint_callback, early_stop_callback],
                          max_epochs=args.epochs)

Replies: 4 comments 5 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by Struggle-Forever
Comment options

You must be logged in to vote
1 reply
@Struggle-Forever
Comment options

Comment options

You must be logged in to vote
4 replies
@Struggle-Forever
Comment options

@htlee6
Comment options

@gpucce
Comment options

@nighting0le01
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic
4 participants