Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.

### Describe the bug

It keeps getting stuck, I don't know what the problem is, one node has 8 H200

### Reproduction

```
accelerate launch --config_file "./accelerate_config.yaml" \
    --use_deepspeed \
    --deepspeed_config_file "./deepspeed_config.json" \
    "diffusers/examples/dreambooth/train_dreambooth_flux.py" \
    --pretrained_model_name_or_path /data/training-flux/models/FLUX.1-dev \
    --instance_data_dir /data/training-flux/dataset/cleaned_2048/images \
    --output_dir /data/training-flux/out/flux-full-1536 \
    --resolution 1536 --train_batch_size 4 \
    --gradient_accumulation_steps 2 --learning_rate 4.5e-5 \
    --lr_scheduler cosine --lr_warmup_steps 1000 --max_train_steps 30000 \
    --mixed_precision bf16 --gradient_checkpointing \
    --dataloader_num_workers 48 --validation_prompt TOK \
    --validation_epochs 1000 --instance_prompt TOK \
    --logging_dir /data/training-flux/logs/flux-full-1536 \
    --checkpointing_steps 1000
```

### Logs

```shell
[rank3]:[E1002 18:16:40.711756918 ProcessGroupNCCL.cpp:756] [Rank 3] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank4]:[E1002 18:16:40.716104108 ProcessGroupNCCL.cpp:632] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank4]:[E1002 18:16:40.716179665 ProcessGroupNCCL.cpp:756] [Rank 4] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank3]:[E1002 18:16:41.861574428 ProcessGroupNCCL.cpp:684] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1002 18:16:41.861591468 ProcessGroupNCCL.cpp:698] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1002 18:16:41.916781679 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1002 18:16:41.916802393 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E1002 18:16:41.475392232 ProcessGroupNCCL.cpp:684] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1002 18:16:41.475410209 ProcessGroupNCCL.cpp:698] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E1002 18:16:41.477033790 ProcessGroupNCCL.cpp:684] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E1002 18:16:41.477051831 ProcessGroupNCCL.cpp:698] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1002 18:16:41.488494157 ProcessGroupNCCL.cpp:684] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1002 18:16:41.488510571 ProcessGroupNCCL.cpp:698] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E1002 18:16:41.500168421 ProcessGroupNCCL.cpp:684] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E1002 18:16:41.500187510 ProcessGroupNCCL.cpp:698] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1002 18:16:41.516382174 ProcessGroupNCCL.cpp:684] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1002 18:16:41.516400948 ProcessGroupNCCL.cpp:698] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/training-flux/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1913, in <module>
[rank3]:     main(args)
[rank3]:   File "/data/training-flux/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1750, in main
[rank3]:     accelerator.backward(loss)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2726, in backward
[rank3]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 270, in backward
[rank3]:     self.engine.backward(loss, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2323, in backward
[rank3]:     self._do_optimizer_backward(loss, retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2269, in _do_optimizer_backward
[rank3]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2246, in backward
[rank3]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 65, in backward
[rank3]:     scaled_loss.backward(retain_graph=retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
[rank3]:     torch.autograd.backward(
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook
[rank3]:     self.process_gradients(param, i)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients
[rank3]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank3]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1007, in reduce_independent_p_g_buckets_and_remove_grads
[rank3]:     self.reduce_ipg_grads()
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1476, in reduce_ipg_grads
[rank3]:     self.average_tensor(bucket.buffer[bucket.index].narrow(0, 0, bucket.elements), comm_dtype)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1224, in average_tensor
[rank3]:     self.allreduce_and_scatter(buckets[bucket_key],
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1129, in allreduce_and_scatter
[rank3]:     self.allreduce_and_copy_with_multiple_ranks(small_bucket,
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1091, in allreduce_and_copy_with_multiple_ranks
[rank3]:     allreduced = self.allreduce_bucket(small_bucket,
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1617, in allreduce_bucket
[rank3]:     dist.all_reduce(tensor_to_allreduce, group=process_group)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 118, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 654, in all_reduce
[rank3]:     return cdb.all_reduce(tensor, op, group, async_op)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 167, in all_reduce
[rank3]:     return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2815, in all_reduce
[rank3]:     work.wait()
[rank3]: torch.distributed.DistBackendError: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
```

### System Info

```
nvidia-smi                                                                                                                                                                                                                                                                                                             
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
```

```
nvcc --version    
                                                                                                                                                                                                                                                                                                     
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
```

system:
```
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
```

### Who can help?

```
10/02/2025 18:06:38 - INFO - accelerate.accelerator - Saving current state to /data/training-flux/out/flux-full-1536/checkpoint-10
10/02/2025 18:06:38 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[rank1]:[E1002 18:16:40.662329700 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank1]:[E1002 18:16:40.662691402 ProcessGroupNCCL.cpp:756] [Rank 1] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank7]:[E1002 18:16:40.663099355 ProcessGroupNCCL.cpp:632] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank7]:[E1002 18:16:40.663182067 ProcessGroupNCCL.cpp:756] [Rank 7] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank5]:[E1002 18:16:40.692598448 ProcessGroupNCCL.cpp:632] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank5]:[E1002 18:16:40.692672352 ProcessGroupNCCL.cpp:756] [Rank 5] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank2]:[E1002 18:16:40.697781061 ProcessGroupNCCL.cpp:632] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank2]:[E1002 18:16:40.697884496 ProcessGroupNCCL.cpp:756] [Rank 2] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank6]:[E1002 18:16:40.705042541 ProcessGroupNCCL.cpp:632] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank6]:[E1002 18:16:40.705120576 ProcessGroupNCCL.cpp:756] [Rank 6] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank3]:[E1002 18:16:40.711664678 ProcessGroupNCCL.cpp:632] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank3]:[E1002 18:16:40.711756918 ProcessGroupNCCL.cpp:756] [Rank 3] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank4]:[E1002 18:16:40.716104108 ProcessGroupNCCL.cpp:632] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank4]:[E1002 18:16:40.716179665 ProcessGroupNCCL.cpp:756] [Rank 4] Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.
[rank3]:[E1002 18:16:41.861574428 ProcessGroupNCCL.cpp:684] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E1002 18:16:41.861591468 ProcessGroupNCCL.cpp:698] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E1002 18:16:41.916781679 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1002 18:16:41.916802393 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E1002 18:16:41.475392232 ProcessGroupNCCL.cpp:684] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E1002 18:16:41.475410209 ProcessGroupNCCL.cpp:698] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E1002 18:16:41.477033790 ProcessGroupNCCL.cpp:684] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E1002 18:16:41.477051831 ProcessGroupNCCL.cpp:698] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E1002 18:16:41.488494157 ProcessGroupNCCL.cpp:684] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E1002 18:16:41.488510571 ProcessGroupNCCL.cpp:698] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E1002 18:16:41.500168421 ProcessGroupNCCL.cpp:684] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E1002 18:16:41.500187510 ProcessGroupNCCL.cpp:698] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1002 18:16:41.516382174 ProcessGroupNCCL.cpp:684] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1002 18:16:41.516400948 ProcessGroupNCCL.cpp:698] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/training-flux/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1913, in <module>
[rank3]:     main(args)
[rank3]:   File "/data/training-flux/diffusers/examples/dreambooth/train_dreambooth_flux.py", line 1750, in main
[rank3]:     accelerator.backward(loss)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2726, in backward
[rank3]:     self.deepspeed_engine_wrapped.backward(loss, sync_gradients=self.sync_gradients, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 270, in backward
[rank3]:     self.engine.backward(loss, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank3]:     ret_val = func(*args, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2323, in backward
[rank3]:     self._do_optimizer_backward(loss, retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2269, in _do_optimizer_backward
[rank3]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2246, in backward
[rank3]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 65, in backward
[rank3]:     scaled_loss.backward(retain_graph=retain_graph)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/_tensor.py", line 648, in backward
[rank3]:     torch.autograd.backward(
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 353, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook
[rank3]:     self.process_gradients(param, i)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients
[rank3]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank3]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1007, in reduce_independent_p_g_buckets_and_remove_grads
[rank3]:     self.reduce_ipg_grads()
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1476, in reduce_ipg_grads
[rank3]:     self.average_tensor(bucket.buffer[bucket.index].narrow(0, 0, bucket.elements), comm_dtype)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1224, in average_tensor
[rank3]:     self.allreduce_and_scatter(buckets[bucket_key],
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1129, in allreduce_and_scatter
[rank3]:     self.allreduce_and_copy_with_multiple_ranks(small_bucket,
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1091, in allreduce_and_copy_with_multiple_ranks
[rank3]:     allreduced = self.allreduce_bucket(small_bucket,
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1617, in allreduce_bucket
[rank3]:     dist.all_reduce(tensor_to_allreduce, group=process_group)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 118, in log_wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 654, in all_reduce
[rank3]:     return cdb.all_reduce(tensor, op, group, async_op)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 167, in all_reduce
[rank3]:     return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/data/training-flux/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2815, in all_reduce
[rank3]:     work.wait()
[rank3]: torch.distributed.DistBackendError: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
```

Start command:
```
accelerate launch --config_file "./accelerate_config.yaml" \
    --use_deepspeed \
    --deepspeed_config_file "./deepspeed_config.json" \
    "diffusers/examples/dreambooth/train_dreambooth_flux.py" \
    --pretrained_model_name_or_path /data/training-flux/models/FLUX.1-dev \
    --instance_data_dir /data/training-flux/dataset/cleaned_2048/images \
    --output_dir /data/training-flux/out/flux-full-1536 \
    --resolution 1536 --train_batch_size 4 \
    --gradient_accumulation_steps 2 --learning_rate 4.5e-5 \
    --lr_scheduler cosine --lr_warmup_steps 1000 --max_train_steps 30000 \
    --mixed_precision bf16 --gradient_checkpointing \
    --dataloader_num_workers 48 --validation_prompt TOK \
    --validation_epochs 1000 --instance_prompt TOK \
    --logging_dir /data/training-flux/logs/flux-full-1536 \
    --checkpointing_steps 1000
```

deepspeed_config.json
```
{
  "train_micro_batch_size_per_gpu": 4,
  "gradient_accumulation_steps": 2,
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 200000000,
    "allgather_bucket_size": 200000000
  },
  "bf16": { "enabled": true },
  "fp16": { "enabled": false },
  "gradient_clipping": 1.0
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait. #12422

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait. #12422

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions