Skip to content

Conversation

@SrijanUpadhyay
Copy link

This commit adds configurations and setup scripts to resolve NCCL timeout issues during DeepSpeed ZeRO-2 training on H200 GPUs. The changes include:

  • Extended NCCL and DeepSpeed timeouts
  • Optimized bucket sizes for gradient communication
  • CPU and dataloader optimizations
  • System shared memory improvements
  • Enhanced debugging capabilities

The implementation provides:

  1. DeepSpeed ZeRO-2 configuration (ds_config_zero2.json)
  2. Environment setup script (setup_training_env.sh)
  3. Accelerate configuration (accelerate_config.yaml)

These changes improve training stability on H200 GPUs with high-resolution data and aggressive configurations.
Fix #12422

This commit adds configurations and setup scripts to resolve NCCL timeout
issues during DeepSpeed ZeRO-2 training on H200 GPUs. The changes include:

- Extended NCCL and DeepSpeed timeouts
- Optimized bucket sizes for gradient communication
- CPU and dataloader optimizations
- System shared memory improvements
- Enhanced debugging capabilities

The implementation provides:
1. DeepSpeed ZeRO-2 configuration (ds_config_zero2.json)
2. Environment setup script (setup_training_env.sh)
3. Accelerate configuration (accelerate_config.yaml)

These changes improve training stability on H200 GPUs with high-resolution
data and aggressive configurations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Work WorkNCCL(SeqNum=2521, OpType=ALLREDUCE, NumelIn=160672064, NumelOut=160672064, Timeout(ms)=600000) timed out in blocking wait.

1 participant