-
Notifications
You must be signed in to change notification settings - Fork 2.9k
feat(train): add auto-scaling option for Multi-GPU Training #2254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add `auto_scale` flag to TrainPipelineConfig - Scale optimizer LR by world size and divide total steps accordingly when using Accelerate with multiple GPUs - Apply scaling before optimizer/scheduler creation and before logging config - Update multi-GPU docs with `--auto_scale=true` usage and explanation - Add multi-GPU test for auto-scale behavior
- Use `--auto_scale=true` consistently in examples and text - Add note on checkpoint/eval cadence: suggest optionally scaling save_freq/eval_freq by world_size and flag as pending maintainer decision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an auto-scaling feature for multi-GPU training that automatically adjusts learning rates and training steps when using multiple processes. When enabled via --auto_scale=true, the system multiplies the learning rate by the number of GPUs and divides training steps proportionally to maintain consistent total sample processing and training dynamics.
Key changes:
- Added
auto_scaleboolean flag to training configuration with logic to scale LR and steps based on world size - Created comprehensive test suite to verify auto-scaling behavior with multi-GPU setups
- Updated documentation to explain the new auto-scaling feature and its usage
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/lerobot/configs/train.py | Adds auto_scale configuration field with documentation |
| src/lerobot/scripts/lerobot_train.py | Implements auto-scaling logic for learning rates and training steps |
| tests/training/test_auto_scale.py | Adds test case for auto-scaling with 2 GPUs |
| docs/source/multi_gpu_training.mdx | Updates documentation to describe auto-scaling feature and usage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Hakjin Lee <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Hakjin Lee <[email protected]>
Co-authored-by: Copilot <[email protected]> Signed-off-by: Hakjin Lee <[email protected]>
|
Hi thank you for this PR, very nice addition. I am very busy this week but I will take a look asap next week! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What this does
This PR adds an opt-in auto-scaling feature for distributed training with Accelerate. When enabled via --auto_scale=true and running with multiple processes (GPUs), LeRobot:
This keeps total sample count roughly comparable and preserves training dynamics while improving throughput.
How it was tested
Examples:
Tested