Skip to content

feat(docs): KEP-2598 XGBoost Runtime for Trainer V2#3118

Open
Krishna-kg732 wants to merge 6 commits intokubeflow:masterfrom
Krishna-kg732:kep-2598-xgboost-runtime
Open

feat(docs): KEP-2598 XGBoost Runtime for Trainer V2#3118
Krishna-kg732 wants to merge 6 commits intokubeflow:masterfrom
Krishna-kg732:kep-2598-xgboost-runtime

Conversation

@Krishna-kg732
Copy link

@Krishna-kg732 Krishna-kg732 commented Jan 22, 2026

What this PR does

This KEP proposes adding an XGBoost Runtime to Kubeflow Trainer V2 to support distributed XGBoost training on Kubernetes using the Rabit-based coordination model.

Why we need it

XGBoost is widely used for structured/tabular data. Currently, users must manually configure Rabit environment variables for distributed training. This KEP enables declarative XGBoost distributed training through Trainer V2's Runtime API.

Key Proposals

  • Add XGBoostMLPolicySource to existing MLPolicySource struct
  • Create XGBoost plugin implementing EnforceMLPolicyPlugin
  • Inject Rabit environment variables (DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER)
  • Rank-0 pod serves as the Rabit tracker

Related Issues

#2598

Checklist

  • KEP document created at docs/proposals/2598-XGboost-runtime-trainer-v2/README.md
  • Implementation (follow-up PRs)
  • Unit tests
  • E2E tests
  • Documentation

kind/design
area/training-operator

Copilot AI review requested due to automatic review settings January 22, 2026 08:14
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Kubeflow Enhancement Proposal (KEP-2598) for adding XGBoost Runtime support to Kubeflow Trainer V2. The proposal enables declarative distributed XGBoost training on Kubernetes using Rabit-based coordination, eliminating the need for manual environment variable configuration.

Changes:

  • Adds comprehensive KEP documentation proposing XGBoost runtime integration with Trainer V2
  • Proposes new XGBoostMLPolicySource API addition to the existing MLPolicy framework
  • Defines implementation approach for XGBoost plugin with Rabit environment variable injection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coveralls
Copy link

coveralls commented Jan 22, 2026

Pull Request Test Coverage Report for Build 21506714809

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.217%

Totals Coverage Status
Change from base Build 21420370777: 0.0%
Covered Lines: 1241
Relevant Lines: 2423

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Krishna-kg732!
Overall looks good, I left a few suggestions.
Appreciate your review to add native support for XGBoost on Kubernetes!

/cc @tenzen-y @astefanutti @akshaychitneni @nqvuong1998 @siyuanfoundation @trivialfis

| `DMLC_TRACKER_URI` | Address of rank-0 pod (Rabit tracker) | `myjob-node-0-0.myjob` |
| `DMLC_TRACKER_PORT` | Tracker port | `9091` |
| `DMLC_TASK_ID` | Worker rank | `0`, `1`, `2`... |
| `DMLC_NUM_WORKER` | Total worker count | `4` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how are you going to fill the DMLC_NUM_WORKER variable.
What if we have multiple GPUs/CPUs on a single node (e.g. multi-node, multi-gpu TrainJob)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DMLC_NUM_WORKER = numNodes from TrainJob (1 worker per pod).

Multi-GPU: XGBoost's single process can use all GPUs on a node directly—no need for extra workers. Just request multiple GPUs via resourcesPerNode.

Will add a "Parallelism Model" section clarifying this

Copy link
Member

@andreyvelich andreyvelich Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-GPU: XGBoost's single process can use all GPUs on a node directly—no need for extra workers. Just request multiple GPUs via resourcesPerNode.

Is that correct @trivialfis? I thought that we should indicate all available GPUs across all nodes in DMLC_NUM_WORKER.
For example, if we have 2 nodes which have 4 GPUs each:

DMLC_NUM_WORKER=8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGBoost uses one worker/process per GPU.

For example, if we have 2 nodes which have 4 GPUs each:

This statement is correct.

XGBoost's single process can use all GPUs on a node directly—no need for extra workers. J

This one is not.

Copy link
Author

@Krishna-kg732 Krishna-kg732 Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the correction @trivialfis , @andreyvelich . You're right — for multi-GPU training, XGBoost follows a "one worker per GPU" pattern.

So basically:

DMLC_NUM_WORKER = numNodes × numWorkersPerNode
For 2 nodes with 4 GPUs each: DMLC_NUM_WORKER = 8

i will Update environment variable injection logic accordingly and also the "Parallelism Model" section to clarify GPU multi-worker requirements and also add numWorkersPerNode to XGBoostmlPolicySource

My earlier understanding about "XGBoost's single process can use all GPUs" was based on the deprecated n_gpus parameter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right. Yeah, the n_gpus was another relic before modern distributed computing. We gave up on that parameter and decided distributed frameworks are much better places to manage GPUs than XGBoost.

Copy link
Member

@andreyvelich andreyvelich Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add numWorkersPerNode to XGBoostmlPolicySource

Let's dynamically get the number of devices per node based on container resources.
Similar to how we do this in Torch if numProcPerNode is not set: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/torch/torch.go#L121

We don't have use-cases when users want to dynamically configure this parameter as of now, and we might want to consider to deprecate this API for Torch Policy (cc @astefanutti @tenzen-y @akshaychitneni)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve already updated this in the docs to reflect the dynamic behavior. Happy to adjust or add more detail if we decide to formally deprecate the API for Torch policy.

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: trivialfis, nqvuong1998, siyuanfoundation.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Thanks @Krishna-kg732!
Overall looks good, I left a few suggestions.
Appreciate your review to add native support for XGBoost on Kubernetes!

/cc @tenzen-y @astefanutti @akshaychitneni @nqvuong1998 @siyuanfoundation @trivialfis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Krishna-kg732
Copy link
Author

Thanks a lot, Andrey!

I’ll follow up on the suggestions and update accordingly. Appreciate the feedback on the native XGBoost support direction — I’d also love to help work on the implementation going forward and iterate with the folks cc’d.

@trivialfis
Copy link

trivialfis commented Jan 27, 2026

Thank you for the ping. Out of curiosity, how does the trainer load the training data? Asking, as it's quite flexible, particularly with the iterator. These are built to help scale up:

@Krishna-kg732 Krishna-kg732 force-pushed the kep-2598-xgboost-runtime branch from 372d095 to 795b33c Compare January 27, 2026 08:36
@Krishna-kg732
Copy link
Author

Hey @trivialfis,

As per my current understanding, Trainer’s role is limited to providing the distributed execution environment (pod orchestration and DMLC_* / Rabit coordination environment variables). Data loading is fully handled in user code and happens independently on each worker, before rabit is initialized.
the Proposed flow looks like this:

Trainer V2
└── Pod (container running user’s image)
    └── python train.py   ← user script
        ├── load data (DMatrix, iterator, etc.)   ← user code
        ├── xgb.rabit.init()
        └── xgb.train()

Since Trainer doesn’t participate in dataset ingestion, all XG_Boost data-loading patterns work unchanged — including in-memory DMatrix, iterator-based Quantile DMatrix, and external-memory ExtMemQuantileDMatrix.

Please let me know if I’m missing anything or if there’s any scope for improvement

@trivialfis
Copy link

Tthe DMatrix construction involves synchronization for data shape and quantization. It has to live under the rabit context.

@andreyvelich
Copy link
Member

andreyvelich commented Jan 27, 2026

Thank you for the ping. Out of curiosity, how does the trainer load the training data? Asking, as it's quite flexible, particularly with the iterator. These are built to help scale up:

It depends on the user, they can use PersistentVolumes and initializer or just download data as @Krishna-kg732 mentioned to place it to the disk.
Alternatively, users can stream data.

Iterator is quite interesting!
We can potentially leverage Distributed Data Cache with XGBoost to stream tensors directly to GPU nodes: https://www.kubeflow.org/docs/components/trainer/user-guides/data-cache/

cc @akshaychitneni

@andreyvelich
Copy link
Member

@Krishna-kg732 Can you sign your commits please?

@trivialfis
Copy link

It depends on the user, they can

Thank you for sharing, this is really helpful. With either approach, please make very sure the construction of DMatrix lives under the collective context. One might get away with it when the data's shape is regular (dense, uniformly distributed across workers, etc.) and XGBoost doesn't raise an error, but the behaviour is undefined, and the quantization result is invalid.

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
@Krishna-kg732 Krishna-kg732 force-pushed the kep-2598-xgboost-runtime branch 2 times, most recently from 3d4fbee to 890c052 Compare January 28, 2026 11:08
Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
@Krishna-kg732 Krishna-kg732 force-pushed the kep-2598-xgboost-runtime branch from 890c052 to 78126c6 Compare January 28, 2026 11:18
… convention

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Krishna-kg732!
Overall, lgtm
Would you be able to attend our next Trainer call (6am PST Wed) to give an overview of this KEP?
https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit?tab=t.0#heading=h.7tnyayn3gyqa

// - CPU training: defaults to 1
// DMLC_NUM_WORKER = numNodes × numWorkersPerNode
// +optional
NumWorkersPerNode *int32 `json:"numWorkersPerNode,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we remove this from the initial implementation.
Let's dynamically calculate this value based on available GPU/CPU resources in the container spec.

Suggested change
NumWorkersPerNode *int32 `json:"numWorkersPerNode,omitempty"`

DMLC_NUM_WORKER = numNodes × numWorkersPerNode
```

- **CPU Training:** Typically 1 worker per node (XGBoost uses multi-threading within a process)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we have multiple CPUs in a single node, does XGBoost can run multiple workers per node?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding for vanilla XGBoost on CPU, it’s always 1 worker per node.
Even if a node has multiple CPUs/cores, XGBoost uses multi-threading within a one worker process rather than spawning multiple workers per node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis Is that correct that XGBoost cannot run multi worker per node for CPU workloads?

Copy link

@trivialfis trivialfis Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can if you want, maybe for multi-socket systems where pinning the process might yield better performance, maybe for testing purposes.

In practice, we don't usually do that unless under very specific, weird, old cloud environment where virtual CPUs have really low performance with multi-thread applications. I have been told about that on GCP very long time ago, but haven't seen it myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so we can start with a single worker per node for CPU for now, and see if users will have other use-cases moving forward.
I guess, XGBoost workers will still consume all available CPU capacity, right? We might need to tests it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should consume all CPU cores.


**Dockerfile example:**
```dockerfile
FROM python:3.11-slim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want image that supports GPU workloads, we might want to use NVIDIA CUDA images as base, like we do here: https://github.com/kubeflow/trainer/blob/master/cmd/runtimes/deepspeed/Dockerfile#L2
We need to see if we can have single image for both CPU and GPU.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can use a single image for both CPU and GPU, following the DeepSpeed pattern, I will update the KEP to use NVIDIA CUDA images as base.

@Krishna-kg732
Copy link
Author

Would you be able to attend our next Trainer call (6am PST Wed) to give an overview of this KEP?

Yes, I can join at 6:30 AM PST if that works. Happy to give an overview of the KEP.

@andreyvelich
Copy link
Member

/milestone v2.2

@google-oss-prow google-oss-prow bot added this to the v2.2 milestone Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants