Distributed training is stuck #21184

syyxsxx · 2024-05-11T03:39:41Z

syyxsxx
May 11, 2024

I use two 4090 host for data parallel distributed training by jax.distributed, like this:
jax.distributed.initialize(coordinator_address="[ip]:[port]",
num_processes=2,
process_id=[index])
the train is stuck when doing all_reduce ops

How can I debug this problem？
Are there any examples for parallel distributed training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed training is stuck #21184

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Distributed training is stuck #21184

Uh oh!

syyxsxx May 11, 2024

Replies: 0 comments

syyxsxx
May 11, 2024