-
Notifications
You must be signed in to change notification settings - Fork 884
feat(runtimes): KEP-2442-jax-runtime #2878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
32cb4cd
30411f7
cc896d2
afe56ca
cd51833
feccb57
828e55f
65d84cc
e7beb80
a249d1d
cb51c95
1bf5e03
8c2e746
51e709b
9836a7d
f8bfb05
3d695cc
b98e7b9
86f3065
1d90a48
f31f57b
0a5b2f9
1e23994
c0be33a
b161972
fc0c5e7
14ab32f
660ff1b
76842d3
a761a28
a255feb
1f3bbb5
b17a223
9839e3e
c864c14
9fd682e
5c19e97
fc78b9c
25fd59e
ad957d2
9b3f6c6
c7f23cc
15cdd84
60d1efb
1262f40
d9f6359
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| FROM ghcr.io/nvidia/jax:jax-2026-01-04 as gpu-base | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| RUN apt update && apt install -y --no-install-recommends \ | ||
| build-essential \ | ||
| cmake \ | ||
| git \ | ||
| libgoogle-glog-dev \ | ||
| libgflags-dev \ | ||
| libprotobuf-dev \ | ||
| protobuf-compiler \ | ||
| python3-dev pip && rm -f /usr/bin/python && \ | ||
| ln -s /usr/bin/python3 /usr/bin/python && \ | ||
| rm -rf /var/lib/apt/lists/* | ||
|
|
||
| RUN pip install --no-cache-dir --upgrade pip | ||
|
|
||
| FROM gpu-base as tpu-base | ||
|
|
||
| RUN pip install --no-cache-dir "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html || \ | ||
| echo "TPU support not available" && \ | ||
| pip install --no-cache-dir libtpu-nightly || \ | ||
| echo "libtpu-nightly not available" | ||
|
Comment on lines
+20
to
+23
|
||
|
|
||
| FROM tpu-base as gloo-base | ||
|
|
||
| RUN git clone https://github.com/facebookincubator/gloo.git \ | ||
| && cd gloo \ | ||
| && git checkout 43b7acbf372cdce14075f3526e39153b7e433b53 \ | ||
| && mkdir build \ | ||
| && cd build \ | ||
| && cmake ../ \ | ||
| && make \ | ||
| && make install | ||
|
Comment on lines
+27
to
+34
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need gloo ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i don't know what you mean, shall i simplify image to only GPU? |
||
|
|
||
| FROM gloo-base as production | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: ClusterTrainingRuntime | ||
| metadata: | ||
| name: jax-distributed | ||
| labels: | ||
| trainer.kubeflow.org/framework: jax | ||
| spec: | ||
| mlPolicy: | ||
| numNodes: 1 | ||
| jax: {} | ||
| template: | ||
| spec: | ||
| replicatedJobs: | ||
| - name: node | ||
| template: | ||
| metadata: | ||
| labels: | ||
| trainer.kubeflow.org/trainjob-ancestor-step: trainer | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: node | ||
| image: ghcr.io/kubeflow/trainer/jax-runtime |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TPU installation logic (lines 28-31) uses
||with echo statements, which means if the first pip install fails, it will echo a message and then try the second install. However, if the second install also fails, the build will continue without error. This masks installation failures. Consider using explicit error handling or removing the|| echopatterns to ensure build failures are visible.