trainer: add JAX trainer guide for TPU#4343

Merged

google-oss-prow[bot] merged 2 commits intokubeflow:masterfrom

siyuanfoundation:tpu

Mar 27, 2026

Contributor

siyuanfoundation commented Mar 16, 2026 •

edited

Loading

Description of Changes

This PR adds a JAX user guide describing how to run distributed JAX
training jobs with Kubeflow Trainer on TPUs.

Related Issues

Related: kubeflow/trainer#3183

Checklist

You have signed off your commits
Ensure you follow best practices from our contributing guide.
(for big changes) I will post screenshots of the changes in a PR comment

google-oss-prow bot added the needs-ok-to-test label

google-oss-prow bot commented Mar 16, 2026

Hi @siyuanfoundation. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow bot added size/L area/trainer labels

google-oss-prow bot requested review from ChanYiLin and gaocegege

March 16, 2026 20:48

github-actions bot commented Mar 16, 2026

🚫 This command cannot be processed. Only organization members or owners can use the commands.


          [trainer] add Jax trainer guide for TPU

7e03585

Signed-off-by: siyuanfoundation <sizhang@google.com>

siyuanfoundation force-pushed the tpu branch from e926df2 to 7e03585 Compare

March 16, 2026 20:58

Contributor Author

siyuanfoundation commented Mar 16, 2026

/cc @andreyvelich

google-oss-prow bot requested a review from andreyvelich

March 16, 2026 20:59

andreyvelich reviewed

View reviewed changes

Member

andreyvelich left a comment

Looks great, overall lgtm, left a few comments.
Thank you for this @siyuanfoundation!
/assign @akshaychitneni @kubeflow/kubeflow-trainer-team

content/en/docs/components/trainer/user-guides/jax-tpu.md Outdated Show resolved Hide resolved

content/en/docs/components/trainer/user-guides/jax-tpu.md

+              ## JAX on TPU Overview
+              JAX on TPU requires a different runtime environment than GPU. Specifically:
+              - **Image**: You must use a JAX image compatible with TPUs (e.g., `us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu`).

Member

andreyvelich Mar 16, 2026 •

edited

Loading

Maybe you can add example how ClusterTrainingRuntime might look like?
Do you know if users want to set node selectors per job, or this is something that cluster admins can configure when they create reusable ClusterTrainingRuntime?

As @kaisoz mentioned in this PR, our default ClusterTrainingRuntime's image doesn't support TPUs: kubeflow/trainer#3151 (comment)
cc @kubeflow/kubeflow-trainer-team

content/en/docs/components/trainer/user-guides/jax-tpu.md

+              JAX on TPU requires a different runtime environment than GPU. Specifically:
+              - **Image**: You must use a JAX image compatible with TPUs (e.g., `us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu`).
+              - **Resources**: You must request `google.com/tpu` resources.
+              - **Node Selectors**: You must specify GKE-specific node selectors and topology for TPU nodes.

Member

andreyvelich Mar 16, 2026

I know that JobSet also supports Exclusive Topology for TPU workload placement:

alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool

Are there any interest from GKE team to showcase how this can be used with the TrainJob too?

Additionally, TPU multi-slice examples: kubernetes-sigs/jobset#1168

cc @GiuseppeTT @imreddy13

Contributor Author

siyuanfoundation Mar 27, 2026

the multi-slice support will depend on kubeflow/trainer#2318

content/en/docs/components/trainer/user-guides/jax-tpu.md Outdated


		### Node Selectors and Topology

		When running on GKE, TPUs are often managed via [Compute Classes](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus-compute-class). You must match the `node_selector` to your TPU node pool labels:

Member

andreyvelich Mar 16, 2026

This URL doesn't work.

Contributor Author

siyuanfoundation Mar 27, 2026

fixed.

content/en/docs/components/trainer/user-guides/jax-tpu.md

Comment on lines +18 to +19

		apiVersion: cloud.google.com/v1
		kind: ComputeClass

Member

andreyvelich Mar 16, 2026

Does it require DRA driver to be installed? Shall we mention this?

Contributor Author

siyuanfoundation Mar 27, 2026

No, it does not.

content/en/docs/components/trainer/user-guides/jax.md Outdated Show resolved Hide resolved

Member

andreyvelich commented Mar 16, 2026

/ok-to-test

google-oss-prow bot added ok-to-test and removed needs-ok-to-test labels

github-actions bot commented Mar 16, 2026

Approvals successfully granted for pending runs.

siyuanfoundation changed the title ~~[trainer] add Jax trainer guide for TPU~~ trainer : add Jax trainer guide for TPU

siyuanfoundation force-pushed the tpu branch from 318df12 to a50f7e1 Compare

March 27, 2026 19:41


          Update content/en/docs/components/trainer/user-guides/jax.md

51732bd

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Siyuan Zhang <10984162+siyuanfoundation@users.noreply.github.com>

siyuanfoundation force-pushed the tpu branch from a50f7e1 to 51732bd Compare

March 27, 2026 19:50

andreyvelich changed the title ~~trainer : add Jax trainer guide for TPU~~ trainer: add JAX trainer guide for TPU

andreyvelich reviewed

View reviewed changes

Member

andreyvelich left a comment •

edited

Loading

I think, we should be in good shape to merge this.
We can address this in the followup if needed: #4343 (comment)
Thanks for this work @siyuanfoundation!
/lgtm
/approve

google-oss-prow bot assigned andreyvelich

google-oss-prow bot added the lgtm label

google-oss-prow bot commented Mar 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~content/en/docs/components/trainer/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the approved label

google-oss-prow bot merged commit 682bb9c into kubeflow:master

7 of 8 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/trainer lgtm ok-to-test size/L