Skip to content

Conversation

@SeasonPilot
Copy link

#11

Implements Ray Train support for scalable, fault-tolerant distributed training while maintaining 100% backward compatibility with existing Accelerate workflow.

Core Changes:

  • Add RayTrainConfig configuration class with YAML loading support
  • Implement DistributedContext abstraction layer for dual backend support
  • Create ray_train.py main training script with RayF2LLMTrainer wrapper
  • Add run_ray_train.py launcher script with CLI interface
  • Update model.py device handling for framework compatibility
  • Add unit tests for DistributedContext with both backends

Features:

  • Seamless scaling from single-node to multi-node clusters
  • Automatic fault tolerance and recovery mechanisms
  • DeepSpeed ZeRO-2 integration for memory optimization
  • HuggingFace checkpoint format compatibility
  • Unified distributed operations API

Technical Details:

  • Backend auto-detection (Ray Train vs Accelerate)
  • PyTorch DDP with NCCL backend for GPU communication
  • Flash Attention 2 support
  • Multi-dataset weighted sampling (50+ datasets)
  • TensorBoard logging integration

Testing:

  • Unit tests for DistributedContext dual backend functionality
  • Test coverage for gather, sync, prepare, unwrap operations
  • Mock-based testing for both Ray and Accelerate backends

Dependencies:

  • ray[train]>=2.30.0
  • pyyaml>=6.0
  • transformers>=4.51.0

)

Implements Ray Train support for scalable, fault-tolerant distributed training
while maintaining 100% backward compatibility with existing Accelerate workflow.

Core Changes:
- Add RayTrainConfig configuration class with YAML loading support
- Implement DistributedContext abstraction layer for dual backend support
- Create ray_train.py main training script with RayF2LLMTrainer wrapper
- Add run_ray_train.py launcher script with CLI interface
- Update model.py device handling for framework compatibility
- Add unit tests for DistributedContext with both backends

Features:
- Seamless scaling from single-node to multi-node clusters
- Automatic fault tolerance and recovery mechanisms
- DeepSpeed ZeRO-2 integration for memory optimization
- HuggingFace checkpoint format compatibility
- Unified distributed operations API

Technical Details:
- Backend auto-detection (Ray Train vs Accelerate)
- PyTorch DDP with NCCL backend for GPU communication
- Flash Attention 2 support
- Multi-dataset weighted sampling (50+ datasets)
- TensorBoard logging integration

Testing:
- Unit tests for DistributedContext dual backend functionality
- Test coverage for gather, sync, prepare, unwrap operations
- Mock-based testing for both Ray and Accelerate backends

Dependencies:
- ray[train]>=2.30.0
- pyyaml>=6.0
- transformers>=4.51.0
@SeasonPilot
Copy link
Author

#11

1 similar comment
@SeasonPilot
Copy link
Author

#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant