feat(docs): KEP-2598 XGBoost Runtime for Trainer V2 by Krishna-kg732 · Pull Request #3118 · kubeflow/trainer

Krishna-kg732 · 2026-01-22T08:14:15Z

What this PR does

This KEP proposes adding an XGBoost Runtime to Kubeflow Trainer V2 to support distributed XGBoost training on Kubernetes using the Rabit-based coordination model.

Why we need it

XGBoost is widely used for structured/tabular data. Currently, users must manually configure Rabit environment variables for distributed training. This KEP enables declarative XGBoost distributed training through Trainer V2's Runtime API.

Key Proposals

Add XGBoostMLPolicySource to existing MLPolicySource struct
Create XGBoost plugin implementing EnforceMLPolicyPlugin
Inject Rabit environment variables (DMLC_TRACKER_URI, DMLC_TRACKER_PORT, DMLC_TASK_ID, DMLC_NUM_WORKER)
Rank-0 pod serves as the Rabit tracker

Related Issues

#2598

Checklist

KEP document created at docs/proposals/2598-XGboost-runtime-trainer-v2/README.md
Implementation (follow-up PRs)
Unit tests
E2E tests
Documentation

kind/design
area/training-operator

google-oss-prow · 2026-01-22T08:14:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-01-22T08:14:24Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR introduces a Kubeflow Enhancement Proposal (KEP-2598) for adding XGBoost Runtime support to Kubeflow Trainer V2. The proposal enables declarative distributed XGBoost training on Kubernetes using Rabit-based coordination, eliminating the need for manual environment variable configuration.

Changes:

Adds comprehensive KEP documentation proposing XGBoost runtime integration with Trainer V2
Proposes new XGBoostMLPolicySource API addition to the existing MLPolicy framework
Defines implementation approach for XGBoost plugin with Rabit environment variable injection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

coveralls · 2026-01-22T11:57:13Z

Pull Request Test Coverage Report for Build 21506714809

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.217%

Totals
Change from base Build 21420370777:	0.0%
Covered Lines:	1241
Relevant Lines:	2423

💛 - Coveralls

andreyvelich

Thanks @Krishna-kg732!
Overall looks good, I left a few suggestions.
Appreciate your review to add native support for XGBoost on Kubernetes!

/cc @tenzen-y @astefanutti @akshaychitneni @nqvuong1998 @siyuanfoundation @trivialfis

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

andreyvelich · 2026-01-27T01:48:43Z

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

+| `DMLC_TRACKER_URI` | Address of rank-0 pod (Rabit tracker) | `myjob-node-0-0.myjob` |
+| `DMLC_TRACKER_PORT` | Tracker port | `9091` |
+| `DMLC_TASK_ID` | Worker rank | `0`, `1`, `2`... |
+| `DMLC_NUM_WORKER` | Total worker count | `4` |


Please explain how are you going to fill the DMLC_NUM_WORKER variable.
What if we have multiple GPUs/CPUs on a single node (e.g. multi-node, multi-gpu TrainJob)?

DMLC_NUM_WORKER = numNodes from TrainJob (1 worker per pod).

Multi-GPU: XGBoost's single process can use all GPUs on a node directly—no need for extra workers. Just request multiple GPUs via resourcesPerNode.

Will add a "Parallelism Model" section clarifying this

Multi-GPU: XGBoost's single process can use all GPUs on a node directly—no need for extra workers. Just request multiple GPUs via resourcesPerNode.

Is that correct @trivialfis? I thought that we should indicate all available GPUs across all nodes in DMLC_NUM_WORKER.
For example, if we have 2 nodes which have 4 GPUs each:

DMLC_NUM_WORKER=8

XGBoost uses one worker/process per GPU.

For example, if we have 2 nodes which have 4 GPUs each:

This statement is correct.

XGBoost's single process can use all GPUs on a node directly—no need for extra workers. J

This one is not.

Thank you for the correction @trivialfis , @andreyvelich . You're right — for multi-GPU training, XGBoost follows a "one worker per GPU" pattern.

So basically:

DMLC_NUM_WORKER = numNodes × numWorkersPerNode
For 2 nodes with 4 GPUs each: DMLC_NUM_WORKER = 8

i will Update environment variable injection logic accordingly and also the "Parallelism Model" section to clarify GPU multi-worker requirements and also add numWorkersPerNode to XGBoostmlPolicySource

My earlier understanding about "XGBoost's single process can use all GPUs" was based on the deprecated n_gpus parameter

Looks right. Yeah, the n_gpus was another relic before modern distributed computing. We gave up on that parameter and decided distributed frameworks are much better places to manage GPUs than XGBoost.

also add numWorkersPerNode to XGBoostmlPolicySource

Let's dynamically get the number of devices per node based on container resources.
Similar to how we do this in Torch if numProcPerNode is not set: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/torch/torch.go#L121

We don't have use-cases when users want to dynamically configure this parameter as of now, and we might want to consider to deprecate this API for Torch Policy (cc @astefanutti @tenzen-y @akshaychitneni)

I’ve already updated this in the docs to reflect the dynamic behavior. Happy to adjust or add more detail if we decide to formally deprecate the API for Torch policy.

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

google-oss-prow · 2026-01-27T02:16:13Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: trivialfis, nqvuong1998, siyuanfoundation.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

Thanks @Krishna-kg732!
Overall looks good, I left a few suggestions.
Appreciate your review to add native support for XGBoost on Kubernetes!

/cc @tenzen-y @astefanutti @akshaychitneni @nqvuong1998 @siyuanfoundation @trivialfis

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Krishna-kg732 · 2026-01-27T05:06:18Z

Thanks a lot, Andrey!

I’ll follow up on the suggestions and update accordingly. Appreciate the feedback on the native XGBoost support direction — I’d also love to help work on the implementation going forward and iterate with the folks cc’d.

trivialfis · 2026-01-27T07:21:52Z

Thank you for the ping. Out of curiosity, how does the trainer load the training data? Asking, as it's quite flexible, particularly with the iterator. These are built to help scale up:

Krishna-kg732 · 2026-01-27T12:10:46Z

Hey @trivialfis,

As per my current understanding, Trainer’s role is limited to providing the distributed execution environment (pod orchestration and DMLC_* / Rabit coordination environment variables). Data loading is fully handled in user code and happens independently on each worker, before rabit is initialized.
the Proposed flow looks like this:

Trainer V2
└── Pod (container running user’s image)
    └── python train.py   ← user script
        ├── load data (DMatrix, iterator, etc.)   ← user code
        ├── xgb.rabit.init()
        └── xgb.train()

Since Trainer doesn’t participate in dataset ingestion, all XG_Boost data-loading patterns work unchanged — including in-memory DMatrix, iterator-based Quantile DMatrix, and external-memory ExtMemQuantileDMatrix.

Please let me know if I’m missing anything or if there’s any scope for improvement

trivialfis · 2026-01-27T13:13:06Z

Tthe DMatrix construction involves synchronization for data shape and quantization. It has to live under the rabit context.

andreyvelich · 2026-01-27T14:17:58Z

Thank you for the ping. Out of curiosity, how does the trainer load the training data? Asking, as it's quite flexible, particularly with the iterator. These are built to help scale up:

It depends on the user, they can use PersistentVolumes and initializer or just download data as @Krishna-kg732 mentioned to place it to the disk.
Alternatively, users can stream data.

Iterator is quite interesting!
We can potentially leverage Distributed Data Cache with XGBoost to stream tensors directly to GPU nodes: https://www.kubeflow.org/docs/components/trainer/user-guides/data-cache/

cc @akshaychitneni

andreyvelich · 2026-01-27T14:20:59Z

@Krishna-kg732 Can you sign your commits please?

trivialfis · 2026-01-27T14:58:03Z

It depends on the user, they can

Thank you for sharing, this is really helpful. With either approach, please make very sure the construction of DMatrix lives under the collective context. One might get away with it when the data's shape is regular (dense, uniformly distributed across workers, etc.) and XGBoost doesn't raise an error, but the behaviour is undefined, and the quantization result is invalid.

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

… convention Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

andreyvelich

Thanks @Krishna-kg732!
Overall, lgtm
Would you be able to attend our next Trainer call (6am PST Wed) to give an overview of this KEP?
https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit?tab=t.0#heading=h.7tnyayn3gyqa

andreyvelich · 2026-02-03T02:40:29Z

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

+    //   - CPU training: defaults to 1
+    // DMLC_NUM_WORKER = numNodes × numWorkersPerNode
+    // +optional
+    NumWorkersPerNode *int32 `json:"numWorkersPerNode,omitempty"`


I would suggest we remove this from the initial implementation.
Let's dynamically calculate this value based on available GPU/CPU resources in the container spec.

Suggested change

NumWorkersPerNode *int32 `json:"numWorkersPerNode,omitempty"`

andreyvelich · 2026-02-03T02:45:20Z

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

+DMLC_NUM_WORKER = numNodes × numWorkersPerNode
+```
+
+- **CPU Training:** Typically 1 worker per node (XGBoost uses multi-threading within a process)


When we have multiple CPUs in a single node, does XGBoost can run multiple workers per node?

As per my understanding for vanilla XGBoost on CPU, it’s always 1 worker per node.
Even if a node has multiple CPUs/cores, XGBoost uses multi-threading within a one worker process rather than spawning multiple workers per node.

@trivialfis Is that correct that XGBoost cannot run multi worker per node for CPU workloads?

It can if you want, maybe for multi-socket systems where pinning the process might yield better performance, maybe for testing purposes.

In practice, we don't usually do that unless under very specific, weird, old cloud environment where virtual CPUs have really low performance with multi-thread applications. I have been told about that on GCP very long time ago, but haven't seen it myself.

I see, so we can start with a single worker per node for CPU for now, and see if users will have other use-cases moving forward.
I guess, XGBoost workers will still consume all available CPU capacity, right? We might need to tests it.

Yes, it should consume all CPU cores.

andreyvelich · 2026-02-03T02:47:50Z

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md

+
+**Dockerfile example:**
+```dockerfile
+FROM python:3.11-slim


If we want image that supports GPU workloads, we might want to use NVIDIA CUDA images as base, like we do here: https://github.com/kubeflow/trainer/blob/master/cmd/runtimes/deepspeed/Dockerfile#L2
We need to see if we can have single image for both CPU and GPU.

Yes, we can use a single image for both CPU and GPU, following the DeepSpeed pattern, I will update the KEP to use NVIDIA CUDA images as base.

Krishna-kg732 · 2026-02-03T09:25:33Z

Would you be able to attend our next Trainer call (6am PST Wed) to give an overview of this KEP?

Yes, I can join at 6:30 AM PST if that works. Happy to give an overview of the KEP.

andreyvelich · 2026-02-04T14:59:14Z

/milestone v2.2

Copilot AI review requested due to automatic review settings January 22, 2026 08:14

google-oss-prow bot requested review from jinchihe and kuizhiqing January 22, 2026 08:14

google-oss-prow bot added the size/XL label Jan 22, 2026

Copilot started reviewing on behalf of Krishna-kg732 January 22, 2026 08:14 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md Outdated Show resolved Hide resolved

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md Outdated Show resolved Hide resolved

docs/proposals/2598-XGboost-runtime-trainer-v2/README.md Outdated Show resolved Hide resolved

andreyvelich reviewed Jan 27, 2026

View reviewed changes

google-oss-prow bot requested review from akshaychitneni, astefanutti and tenzen-y January 27, 2026 02:16

Krishna-kg732 force-pushed the kep-2598-xgboost-runtime branch from 372d095 to 795b33c Compare January 27, 2026 08:36

Krishna-kg732 added 4 commits January 27, 2026 21:20

KEP-2598: XGBoost Runtime for Trainer V2

b47cc87

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

final proposal

29c8d9a

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

fix(docs): address review feedback on KEP-2598

4b1e91f

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

docs: update KEP-2598 header format and implementation history

fb01801

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

Krishna-kg732 force-pushed the kep-2598-xgboost-runtime branch 2 times, most recently from 3d4fbee to 890c052 Compare January 28, 2026 11:08

KEP-2598:Adressed review feedback(Update to XGBoost 2.0 collective API)

78126c6

Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

Krishna-kg732 force-pushed the kep-2598-xgboost-runtime branch from 890c052 to 78126c6 Compare January 28, 2026 11:18

KEP-2598: Add auto-derivation of numWorkersPerNode and fix pod naming…

bc987d4

… convention Signed-off-by: Krishna-kg732 <2405732@kiit.ac.in>

andreyvelich reviewed Feb 3, 2026

View reviewed changes

google-oss-prow bot added this to the v2.2 milestone Feb 4, 2026

Conversation

Krishna-kg732 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Why we need it

Key Proposals

Related Issues

Checklist

Uh oh!

google-oss-prow bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21506714809

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Krishna-kg732 Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

google-oss-prow bot commented Jan 27, 2026

Uh oh!

Krishna-kg732 commented Jan 27, 2026

Uh oh!

trivialfis commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Krishna-kg732 commented Jan 27, 2026

Uh oh!

trivialfis commented Jan 27, 2026

Uh oh!

andreyvelich commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Jan 27, 2026

Uh oh!

trivialfis commented Jan 27, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Krishna-kg732 commented Jan 22, 2026 •

edited

Loading

coveralls commented Jan 22, 2026 •

edited

Loading

andreyvelich Jan 27, 2026 •

edited

Loading

Krishna-kg732 Jan 28, 2026 •

edited

Loading

andreyvelich Jan 28, 2026 •

edited

Loading

trivialfis commented Jan 27, 2026 •

edited

Loading

andreyvelich commented Jan 27, 2026 •

edited

Loading

trivialfis Feb 3, 2026 •

edited

Loading