-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Train] Increase scalability of XGBoostTrainer through external memory datasets #55550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: soffer-anyscale <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and valuable feature to XGBoostTrainer
, enabling it to handle datasets larger than memory by leveraging XGBoost's external memory capabilities. The implementation is well-structured, with new utility modules for system detection, parameter optimization, and data iteration. The high-level APIs in train_loop_utils.py
make this advanced feature very easy to use.
My review includes a few suggestions to improve robustness and clarity. I've pointed out a potential memory issue in the custom iterator, suggested improvements to error handling and docstrings, and noted some minor code quality issues like unused imports. Overall, this is a great contribution that significantly enhances the scalability of XGBoost training in Ray.
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: soffer-anyscale <[email protected]>
Why are these changes needed?
The current XGBoostTrainer is limited in scalability due to having to materialize the dataset. This PR adds the ability to scale using the external memory feature and iterate over a larger dataset.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.