feat: Add Ray distributed training framework integration (#11) #19

SeasonPilot · 2025-11-27T07:16:06Z

Implements Ray Train support for scalable, fault-tolerant distributed training while maintaining 100% backward compatibility with existing Accelerate workflow.

Core Changes:

Add RayTrainConfig configuration class with YAML loading support
Implement DistributedContext abstraction layer for dual backend support
Create ray_train.py main training script with RayF2LLMTrainer wrapper
Add run_ray_train.py launcher script with CLI interface
Update model.py device handling for framework compatibility
Add unit tests for DistributedContext with both backends

Features:

Seamless scaling from single-node to multi-node clusters
Automatic fault tolerance and recovery mechanisms
DeepSpeed ZeRO-2 integration for memory optimization
HuggingFace checkpoint format compatibility
Unified distributed operations API

Technical Details:

Backend auto-detection (Ray Train vs Accelerate)
PyTorch DDP with NCCL backend for GPU communication
Flash Attention 2 support
Multi-dataset weighted sampling (50+ datasets)
TensorBoard logging integration

Testing:

Unit tests for DistributedContext dual backend functionality
Test coverage for gather, sync, prepare, unwrap operations
Mock-based testing for both Ray and Accelerate backends

Dependencies:

ray[train]>=2.30.0
pyyaml>=6.0
transformers>=4.51.0

) Implements Ray Train support for scalable, fault-tolerant distributed training while maintaining 100% backward compatibility with existing Accelerate workflow. Core Changes: - Add RayTrainConfig configuration class with YAML loading support - Implement DistributedContext abstraction layer for dual backend support - Create ray_train.py main training script with RayF2LLMTrainer wrapper - Add run_ray_train.py launcher script with CLI interface - Update model.py device handling for framework compatibility - Add unit tests for DistributedContext with both backends Features: - Seamless scaling from single-node to multi-node clusters - Automatic fault tolerance and recovery mechanisms - DeepSpeed ZeRO-2 integration for memory optimization - HuggingFace checkpoint format compatibility - Unified distributed operations API Technical Details: - Backend auto-detection (Ray Train vs Accelerate) - PyTorch DDP with NCCL backend for GPU communication - Flash Attention 2 support - Multi-dataset weighted sampling (50+ datasets) - TensorBoard logging integration Testing: - Unit tests for DistributedContext dual backend functionality - Test coverage for gather, sync, prepare, unwrap operations - Mock-based testing for both Ray and Accelerate backends Dependencies: - ray[train]>=2.30.0 - pyyaml>=6.0 - transformers>=4.51.0

SeasonPilot · 2025-11-27T07:17:21Z

#11

SeasonPilot · 2025-11-27T07:17:24Z

#11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Ray distributed training framework integration (#11) #19

feat: Add Ray distributed training framework integration (#11) #19

Uh oh!

SeasonPilot commented Nov 27, 2025

Uh oh!

SeasonPilot commented Nov 27, 2025

Uh oh!

SeasonPilot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add Ray distributed training framework integration (#11) #19

Are you sure you want to change the base?

feat: Add Ray distributed training framework integration (#11) #19

Uh oh!

Conversation

SeasonPilot commented Nov 27, 2025

Uh oh!

SeasonPilot commented Nov 27, 2025

Uh oh!

SeasonPilot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant