feat(docs): KEP-2598 XGBoost Runtime for Trainer V2#3118
feat(docs): KEP-2598 XGBoost Runtime for Trainer V2#3118Krishna-kg732 wants to merge 6 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a Kubeflow Enhancement Proposal (KEP-2598) for adding XGBoost Runtime support to Kubeflow Trainer V2. The proposal enables declarative distributed XGBoost training on Kubernetes using Rabit-based coordination, eliminating the need for manual environment variable configuration.
Changes:
- Adds comprehensive KEP documentation proposing XGBoost runtime integration with Trainer V2
- Proposes new
XGBoostMLPolicySourceAPI addition to the existing MLPolicy framework - Defines implementation approach for XGBoost plugin with Rabit environment variable injection
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pull Request Test Coverage Report for Build 21506714809Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks @Krishna-kg732!
Overall looks good, I left a few suggestions.
Appreciate your review to add native support for XGBoost on Kubernetes!
/cc @tenzen-y @astefanutti @akshaychitneni @nqvuong1998 @siyuanfoundation @trivialfis
| | `DMLC_TRACKER_URI` | Address of rank-0 pod (Rabit tracker) | `myjob-node-0-0.myjob` | | ||
| | `DMLC_TRACKER_PORT` | Tracker port | `9091` | | ||
| | `DMLC_TASK_ID` | Worker rank | `0`, `1`, `2`... | | ||
| | `DMLC_NUM_WORKER` | Total worker count | `4` | |
There was a problem hiding this comment.
Please explain how are you going to fill the DMLC_NUM_WORKER variable.
What if we have multiple GPUs/CPUs on a single node (e.g. multi-node, multi-gpu TrainJob)?
There was a problem hiding this comment.
DMLC_NUM_WORKER = numNodes from TrainJob (1 worker per pod).
Multi-GPU: XGBoost's single process can use all GPUs on a node directly—no need for extra workers. Just request multiple GPUs via resourcesPerNode.
Will add a "Parallelism Model" section clarifying this
There was a problem hiding this comment.
Multi-GPU: XGBoost's single process can use all GPUs on a node directly—no need for extra workers. Just request multiple GPUs via resourcesPerNode.
Is that correct @trivialfis? I thought that we should indicate all available GPUs across all nodes in DMLC_NUM_WORKER.
For example, if we have 2 nodes which have 4 GPUs each:
DMLC_NUM_WORKER=8
There was a problem hiding this comment.
XGBoost uses one worker/process per GPU.
For example, if we have 2 nodes which have 4 GPUs each:
This statement is correct.
XGBoost's single process can use all GPUs on a node directly—no need for extra workers. J
This one is not.
There was a problem hiding this comment.
Thank you for the correction @trivialfis , @andreyvelich . You're right — for multi-GPU training, XGBoost follows a "one worker per GPU" pattern.
So basically:
DMLC_NUM_WORKER = numNodes × numWorkersPerNode
For 2 nodes with 4 GPUs each: DMLC_NUM_WORKER = 8
i will Update environment variable injection logic accordingly and also the "Parallelism Model" section to clarify GPU multi-worker requirements and also add numWorkersPerNode to XGBoostmlPolicySource
My earlier understanding about "XGBoost's single process can use all GPUs" was based on the deprecated n_gpus parameter
There was a problem hiding this comment.
Looks right. Yeah, the n_gpus was another relic before modern distributed computing. We gave up on that parameter and decided distributed frameworks are much better places to manage GPUs than XGBoost.
There was a problem hiding this comment.
also add numWorkersPerNode to XGBoostmlPolicySource
Let's dynamically get the number of devices per node based on container resources.
Similar to how we do this in Torch if numProcPerNode is not set: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/torch/torch.go#L121
We don't have use-cases when users want to dynamically configure this parameter as of now, and we might want to consider to deprecate this API for Torch Policy (cc @astefanutti @tenzen-y @akshaychitneni)
There was a problem hiding this comment.
I’ve already updated this in the docs to reflect the dynamic behavior. Happy to adjust or add more detail if we decide to formally deprecate the API for Torch policy.
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: trivialfis, nqvuong1998, siyuanfoundation. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Thanks a lot, Andrey! I’ll follow up on the suggestions and update accordingly. Appreciate the feedback on the native XGBoost support direction — I’d also love to help work on the implementation going forward and iterate with the folks cc’d. |
|
Thank you for the ping. Out of curiosity, how does the trainer load the training data? Asking, as it's quite flexible, particularly with the iterator. These are built to help scale up: |
372d095 to
795b33c
Compare
|
Hey @trivialfis, As per my current understanding, Trainer’s role is limited to providing the distributed execution environment (pod orchestration and DMLC_* / Rabit coordination environment variables). Data loading is fully handled in user code and happens independently on each worker, before rabit is initialized. Since Trainer doesn’t participate in dataset ingestion, all XG_Boost data-loading patterns work unchanged — including in-memory DMatrix, iterator-based Quantile DMatrix, and external-memory ExtMemQuantileDMatrix. Please let me know if I’m missing anything or if there’s any scope for improvement |
|
Tthe DMatrix construction involves synchronization for data shape and quantization. It has to live under the rabit context. |
It depends on the user, they can use PersistentVolumes and initializer or just download data as @Krishna-kg732 mentioned to place it to the disk. Iterator is quite interesting! |
|
@Krishna-kg732 Can you sign your commits please? |
Thank you for sharing, this is really helpful. With either approach, please make very sure the construction of DMatrix lives under the collective context. One might get away with it when the data's shape is regular (dense, uniformly distributed across workers, etc.) and XGBoost doesn't raise an error, but the behaviour is undefined, and the quantization result is invalid. |
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
3d4fbee to
890c052
Compare
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
890c052 to
78126c6
Compare
… convention Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks @Krishna-kg732!
Overall, lgtm
Would you be able to attend our next Trainer call (6am PST Wed) to give an overview of this KEP?
https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit?tab=t.0#heading=h.7tnyayn3gyqa
| // - CPU training: defaults to 1 | ||
| // DMLC_NUM_WORKER = numNodes × numWorkersPerNode | ||
| // +optional | ||
| NumWorkersPerNode *int32 `json:"numWorkersPerNode,omitempty"` |
There was a problem hiding this comment.
I would suggest we remove this from the initial implementation.
Let's dynamically calculate this value based on available GPU/CPU resources in the container spec.
| NumWorkersPerNode *int32 `json:"numWorkersPerNode,omitempty"` |
| DMLC_NUM_WORKER = numNodes × numWorkersPerNode | ||
| ``` | ||
|
|
||
| - **CPU Training:** Typically 1 worker per node (XGBoost uses multi-threading within a process) |
There was a problem hiding this comment.
When we have multiple CPUs in a single node, does XGBoost can run multiple workers per node?
There was a problem hiding this comment.
As per my understanding for vanilla XGBoost on CPU, it’s always 1 worker per node.
Even if a node has multiple CPUs/cores, XGBoost uses multi-threading within a one worker process rather than spawning multiple workers per node.
There was a problem hiding this comment.
@trivialfis Is that correct that XGBoost cannot run multi worker per node for CPU workloads?
There was a problem hiding this comment.
It can if you want, maybe for multi-socket systems where pinning the process might yield better performance, maybe for testing purposes.
In practice, we don't usually do that unless under very specific, weird, old cloud environment where virtual CPUs have really low performance with multi-thread applications. I have been told about that on GCP very long time ago, but haven't seen it myself.
There was a problem hiding this comment.
I see, so we can start with a single worker per node for CPU for now, and see if users will have other use-cases moving forward.
I guess, XGBoost workers will still consume all available CPU capacity, right? We might need to tests it.
There was a problem hiding this comment.
Yes, it should consume all CPU cores.
|
|
||
| **Dockerfile example:** | ||
| ```dockerfile | ||
| FROM python:3.11-slim |
There was a problem hiding this comment.
If we want image that supports GPU workloads, we might want to use NVIDIA CUDA images as base, like we do here: https://github.com/kubeflow/trainer/blob/master/cmd/runtimes/deepspeed/Dockerfile#L2
We need to see if we can have single image for both CPU and GPU.
There was a problem hiding this comment.
Yes, we can use a single image for both CPU and GPU, following the DeepSpeed pattern, I will update the KEP to use NVIDIA CUDA images as base.
Yes, I can join at 6:30 AM PST if that works. Happy to give an overview of the KEP. |
|
/milestone v2.2 |
What this PR does
This KEP proposes adding an XGBoost Runtime to Kubeflow Trainer V2 to support distributed XGBoost training on Kubernetes using the Rabit-based coordination model.
Why we need it
XGBoost is widely used for structured/tabular data. Currently, users must manually configure Rabit environment variables for distributed training. This KEP enables declarative XGBoost distributed training through Trainer V2's Runtime API.
Key Proposals
XGBoostMLPolicySourceto existing MLPolicySource structDMLC_TRACKER_URI,DMLC_TRACKER_PORT,DMLC_TASK_ID,DMLC_NUM_WORKER)Related Issues
#2598
Checklist
kind/design
area/training-operator