[Train] Increase scalability of XGBoostTrainer through external memory datasets #55550

soffer-anyscale · 2025-08-12T22:04:50Z

Why are these changes needed?

The current XGBoostTrainer is limited in scalability due to having to materialize the dataset. This PR adds the ability to scale using the external memory feature and iterate over a larger dataset.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: soffer-anyscale <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a significant and valuable feature to XGBoostTrainer, enabling it to handle datasets larger than memory by leveraging XGBoost's external memory capabilities. The implementation is well-structured, with new utility modules for system detection, parameter optimization, and data iteration. The high-level APIs in train_loop_utils.py make this advanced feature very easy to use.

My review includes a few suggestions to improve robustness and clarity. I've pointed out a potential memory issue in the custom iterator, suggested improvements to error handling and docstrings, and noted some minor code quality issues like unused imports. Overall, this is a great contribution that significantly enhances the scalability of XGBoost training in Ray.

python/ray/train/v2/xgboost/_external_memory_utils.py

python/ray/train/v2/xgboost/_param_utils.py

python/ray/train/v2/xgboost/_system_utils.py

Signed-off-by: soffer-anyscale <[email protected]>

initial commit

c0e81a4

Signed-off-by: soffer-anyscale <[email protected]>

soffer-anyscale requested a review from a team as a code owner August 12, 2025 22:04

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

soffer-anyscale added 2 commits August 12, 2025 18:15

updated based on feedback

efb8c39

Signed-off-by: soffer-anyscale <[email protected]>

made scalability improvements and expanded testing

ac8d58e

Signed-off-by: soffer-anyscale <[email protected]>

ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue labels Aug 14, 2025

soffer-anyscale added 3 commits August 14, 2025 18:58

simplified the updates to core functionality and added documentation

5535fa3

Signed-off-by: soffer-anyscale <[email protected]>

updated to fix lint

7745db6

Signed-off-by: soffer-anyscale <[email protected]>

updated external dataset API

83c1283

Signed-off-by: soffer-anyscale <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Train] Increase scalability of XGBoostTrainer through external memory datasets #55550

[Train] Increase scalability of XGBoostTrainer through external memory datasets #55550

Uh oh!

soffer-anyscale commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Train] Increase scalability of XGBoostTrainer through external memory datasets #55550

Are you sure you want to change the base?

[Train] Increase scalability of XGBoostTrainer through external memory datasets #55550

Uh oh!

Conversation

soffer-anyscale commented Aug 12, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!