From 37d5fdc02230ad566c2e1d6bcb58fbf73f1bc91b Mon Sep 17 00:00:00 2001 From: krishna-kg732 Date: Fri, 13 Mar 2026 00:12:49 +0530 Subject: [PATCH 1/4] docs: add XGBoost distributed training user guide Add a user guide for distributed XGBoost training on Kubernetes via Kubeflow Trainer at content/en/docs/components/trainer/user-guides/xgboost.md. The guide provides: - An overview of the XGBoost Collective protocol and how Kubeflow Trainer integrates with it (DMLC_* env vars, JobSet, built-in runtime) - Worker count formula for CPU and GPU training - A redirect to the comprehensive XGBoost tutorial at https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html Signed-off-by: krishna-kg732 --- .../components/trainer/user-guides/xgboost.md | 63 +++++++++++++++++++ 1 file changed, 63 insertions(+) create mode 100644 content/en/docs/components/trainer/user-guides/xgboost.md diff --git a/content/en/docs/components/trainer/user-guides/xgboost.md b/content/en/docs/components/trainer/user-guides/xgboost.md new file mode 100644 index 0000000000..27eacb4fac --- /dev/null +++ b/content/en/docs/components/trainer/user-guides/xgboost.md @@ -0,0 +1,63 @@ ++++ +title = "XGBoost Guide" +description = "How to run distributed XGBoost on Kubernetes with Kubeflow Trainer" +weight = 20 ++++ + +This guide describes how to use TrainJob to run distributed +[XGBoost](https://xgboost.readthedocs.io/) training on Kubernetes. + +--- + +## Prerequisites + +Before exploring this guide, make sure to follow +[the Getting Started guide](/docs/components/trainer/getting-started/) +to understand the basics of Kubeflow Trainer. + +--- + +## XGBoost Distributed Overview + +XGBoost supports distributed training through the +[Collective](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html) +communication protocol (historically known as Rabit). In a distributed setting, +multiple worker processes each operate on a shard of the data and synchronize +histogram bin statistics via AllReduce to agree on the best tree splits. + +Kubeflow Trainer integrates with XGBoost by: + +- Deploying worker pods as a [JobSet](https://github.com/kubernetes-sigs/jobset). +- Automatically injecting the `DMLC_*` environment variables required by XGBoost's + Collective communication layer (`DMLC_TRACKER_URI`, `DMLC_TRACKER_PORT`, + `DMLC_TASK_ID`, `DMLC_NUM_WORKER`). +- Providing the rank-0 pod with the tracker address so user code can start a + `RabitTracker` for worker coordination. +- Supporting both CPU and GPU training workloads. + +The built-in runtime is called `xgboost-distributed` and uses the container image +`ghcr.io/kubeflow/trainer/xgboost-runtime:latest`, which includes XGBoost with +CUDA 12 support, NumPy, and scikit-learn. + +### Worker Count + +The total number of XGBoost workers is calculated as: + +```text +DMLC_NUM_WORKER = numNodes × workersPerNode +``` + +- **CPU training**: 1 worker per node. Each worker uses OpenMP to parallelize + across all available CPU cores. +- **GPU training**: 1 worker per GPU. The GPU count is derived from + `resourcesPerNode` limits in the TrainJob. + +--- + +## Further Information + +For comprehensive documentation including complete training examples (Python SDK +and kubectl YAML), best practices (`QuantileDMatrix`, early stopping, +checkpointing, logging), and common issues, see the XGBoost documentation: + +**[Distributed XGBoost on Kubernetes — XGBoost Tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)** From 8b0c9f7980b576a38f23541ae4dadb1503a4828c Mon Sep 17 00:00:00 2001 From: Krishna Gupta Date: Fri, 13 Mar 2026 06:18:03 +0530 Subject: [PATCH 2/4] Apply suggestions from code review Co-authored-by: Andrey Velichkevich Signed-off-by: Krishna Gupta --- content/en/docs/components/trainer/user-guides/xgboost.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/content/en/docs/components/trainer/user-guides/xgboost.md b/content/en/docs/components/trainer/user-guides/xgboost.md index 27eacb4fac..7783c2012d 100644 --- a/content/en/docs/components/trainer/user-guides/xgboost.md +++ b/content/en/docs/components/trainer/user-guides/xgboost.md @@ -1,14 +1,12 @@ +++ title = "XGBoost Guide" -description = "How to run distributed XGBoost on Kubernetes with Kubeflow Trainer" +description = "How to run XGBoost on Kubernetes with Kubeflow Trainer" weight = 20 +++ This guide describes how to use TrainJob to run distributed [XGBoost](https://xgboost.readthedocs.io/) training on Kubernetes. ---- - ## Prerequisites Before exploring this guide, make sure to follow From 3bf20f532229f5f3f1be12781976d595881a3b10 Mon Sep 17 00:00:00 2001 From: krishna-kg732 Date: Fri, 13 Mar 2026 06:27:02 +0530 Subject: [PATCH 3/4] added notebook example Signed-off-by: krishna-kg732 --- content/en/docs/components/trainer/user-guides/xgboost.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/content/en/docs/components/trainer/user-guides/xgboost.md b/content/en/docs/components/trainer/user-guides/xgboost.md index 7783c2012d..3eca3c456d 100644 --- a/content/en/docs/components/trainer/user-guides/xgboost.md +++ b/content/en/docs/components/trainer/user-guides/xgboost.md @@ -59,3 +59,7 @@ and kubectl YAML), best practices (`QuantileDMatrix`, early stopping, checkpointing, logging), and common issues, see the XGBoost documentation: **[Distributed XGBoost on Kubernetes — XGBoost Tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)** + +You can also use the Kubeflow Trainer distributed XGBoost notebook example: + +**[xgboost-distributed.ipynb](https://github.com/kubeflow/trainer/blob/master/examples/xgboost/distributed-training/xgboost-distributed.ipynb)** From 7965b2f742787b9fb2784766b5a26d11c98900b1 Mon Sep 17 00:00:00 2001 From: krishna-kg732 Date: Fri, 13 Mar 2026 07:23:13 +0530 Subject: [PATCH 4/4] added : next steps section Signed-off-by: krishna-kg732 --- .../components/trainer/user-guides/xgboost.md | 20 +++++-------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/content/en/docs/components/trainer/user-guides/xgboost.md b/content/en/docs/components/trainer/user-guides/xgboost.md index 3eca3c456d..eddd751381 100644 --- a/content/en/docs/components/trainer/user-guides/xgboost.md +++ b/content/en/docs/components/trainer/user-guides/xgboost.md @@ -13,9 +13,8 @@ Before exploring this guide, make sure to follow [the Getting Started guide](/docs/components/trainer/getting-started/) to understand the basics of Kubeflow Trainer. ---- -## XGBoost Distributed Overview +## XGBoost Overview XGBoost supports distributed training through the [Collective](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html) @@ -50,16 +49,7 @@ DMLC_NUM_WORKER = numNodes × workersPerNode - **GPU training**: 1 worker per GPU. The GPU count is derived from `resourcesPerNode` limits in the TrainJob. ---- - -## Further Information - -For comprehensive documentation including complete training examples (Python SDK -and kubectl YAML), best practices (`QuantileDMatrix`, early stopping, -checkpointing, logging), and common issues, see the XGBoost documentation: - -**[Distributed XGBoost on Kubernetes — XGBoost Tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)** - -You can also use the Kubeflow Trainer distributed XGBoost notebook example: - -**[xgboost-distributed.ipynb](https://github.com/kubeflow/trainer/blob/master/examples/xgboost/distributed-training/xgboost-distributed.ipynb)** +## Next Steps +- check out the [xgboost example](https://github.com/kubeflow/trainer/blob/master/examples/xgboost/distributed-training/xgboost-distributed.ipynb) +- learn more about `TrainerClinet()` APIs in the [KubeFlow SDK](https://github.com/kubeflow/sdk/blob/main/kubeflow/trainer/api/trainer_client.py) +- Explore **[XGboost documentation](https://xgboost.readthedocs.io/en/latest/tutorials/kubernetes.html)** for advanced configuration options \ No newline at end of file