Skip to content

Commit d4168ef

Browse files
authored
Merge pull request #116 from raushan2016/cleanup-changes
Add Kubernetes JobSet recipes and documentation for deepseek3-671b and llama3.1-70b MaxText pretraining on Ironwood TPUs
2 parents b005d56 + 0cb5c4b commit d4168ef

File tree

8 files changed

+440
-21
lines changed

8 files changed

+440
-21
lines changed
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Pretrain deepseek3-671b workload on Ironwood GKE clusters with Kubernetes JobSet
2+
3+
This recipe outlines the steps for running a deepseek3-671b
4+
[MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on
5+
[Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine)
6+
by applying a Kubernetes manifest to deploy a JobSet resource.
7+
8+
## Workload Details
9+
10+
This workload is configured with the following details:
11+
12+
- Sequence Length: 4096
13+
- Precision: bf16
14+
- Chips: 128 (4x4x8 topology)
15+
16+
## Prerequisites
17+
18+
This recipe assumes the following prerequisites are met:
19+
20+
- **GKE Cluster:** A GKE cluster with [JobSet](https://jobset.sigs.k8s.io/docs/installation/) installed and running.
21+
- **Container Image:** A pre-built container image (such as
22+
`gcr.io/my-project/my-maxtext-runner:latest`) containing the MaxText
23+
workload, accessible by the GKE cluster.
24+
- **Tools:** `gcloud`, `kubectl`, `gke-gcloud-auth-plugin`, and `envsubst`
25+
installed on your workstation. If `envsubst` is missing, install it with
26+
`sudo apt-get update && sudo apt-get install -y gettext-base`.
27+
- **Permissions:** You have permissions to run `kubectl apply` on the target
28+
cluster and the cluster has permissions to pull the container image.
29+
30+
## Orchestration and deployment tools
31+
32+
For this recipe, the following setup is used:
33+
34+
- **Orchestration** -
35+
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
36+
- **Pretraining job configuration and deployment** - A Kubernetes manifest
37+
(`k8s_manifest.yaml`) is used to define and deploy the
38+
[Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
39+
resource, which manages the execution of the MaxText pretraining workload.
40+
41+
## Training dataset
42+
43+
This recipe uses a mock pretraining dataset provided by the MaxText framework.
44+
45+
## Run the recipe
46+
47+
This recipe uses a Kubernetes manifest (`k8s_manifest.yaml`) to deploy the
48+
workload. The following commands will set the required environment variables,
49+
substitute them into `k8s_manifest.yaml`, and apply the resulting
50+
configuration to your cluster.
51+
52+
### 1. Configure Environment Variables
53+
54+
Open a terminal and set the following environment variables to match your setup.
55+
**Note:**
56+
- `k8s_manifest.yaml` is in the same directory as this README.
57+
- For WORKLOAD_IMAGE, see [Docker container image](../xpk/README.md#docker-container-image) section.
58+
59+
```bash
60+
# Set variables for your environment
61+
export PROJECT_ID="" # Your GCP project name
62+
export CLUSTER_NAME="" # The name of your GKE cluster
63+
export ZONE="" # The zone of your GKE cluster
64+
export BASE_OUTPUT_DIR="" # e.g., "gs://your-bucket-name/my-base-output-dir"
65+
export WORKLOAD_IMAGE="" # e.g., "gcr.io/my-project/my-maxtext-runner:latest".
66+
67+
# Set workload name (or modify as needed, make sure its unique in the cluster)
68+
export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
69+
```
70+
71+
### 2. Run deepseekv3-671b Pretraining Workload
72+
73+
Once the environment variables are set, run the following commands to fetch
74+
cluster credentials and deploy the JobSet:
75+
76+
```bash
77+
# Fetch cluster credentials
78+
gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${ZONE} --project ${PROJECT_ID}
79+
80+
# Apply the manifest
81+
envsubst '${BASE_OUTPUT_DIR} ${WORKLOAD_NAME} ${WORKLOAD_IMAGE}' < k8s_manifest.yaml | kubectl apply -n default -f -
82+
```
83+
84+
## Monitor the job
85+
86+
To monitor your job's progress, you can use kubectl to check the Jobset status
87+
and logs:
88+
89+
```bash
90+
# Check JobSet status
91+
kubectl get jobset -n default ${WORKLOAD_NAME}
92+
93+
# Get the name of the first pod in the JobSet
94+
POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=${WORKLOAD_NAME} -n default -o jsonpath='{.items[0].metadata.name}')
95+
96+
# Follow the logs of that pod
97+
kubectl logs -f -n default ${POD_NAME}
98+
```
99+
100+
You can also monitor your cluster and TPU usage through the Google Cloud
101+
Console:
102+
`https://console.cloud.google.com/kubernetes/workload/overview?project={PROJECT_ID}`
103+
104+
## Delete resources
105+
106+
### Delete a specific workload
107+
108+
To delete the JobSet created by this recipe, run:
109+
110+
```bash
111+
kubectl delete jobset ${WORKLOAD_NAME} -n default
112+
```
113+
114+
## Check results
115+
116+
After the job completes, you can check the results by:
117+
118+
- Accessing output logs from your job using `kubectl logs`.
119+
- Checking any data stored in the Google Cloud Storage bucket specified by the
120+
`${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`.
121+
- Reviewing metrics in Cloud Monitoring, if configured.
122+
123+
## Next steps: deeper exploration and customization
124+
125+
This recipe is designed to provide a simple, reproducible "0-to-1" experience
126+
for running a MaxText pre-training workload. Its primary purpose is to help you
127+
verify your environment and achieve a first success with TPUs quickly and
128+
reliably.
129+
130+
For deeper exploration, including customizing model configurations, tuning
131+
performance with different XLA flags, and running custom experiments, we
132+
recommend using the benchmark_runner.py script directly from the MaxText
133+
repository. This script offers the full range of MaxText's flexibility and is
134+
the ideal tool for power users and researchers who want to move beyond the
135+
initial benchmark and tailor the workload to their specific needs. To learn
136+
more, see the
137+
[MaxText Benchmark Runner Guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/Getting_Started_Benchmarking.md)
138+
on using benchmark_runner.py for advanced benchmarking.
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
apiVersion: jobset.x-k8s.io/v1alpha2
2+
kind: JobSet
3+
metadata:
4+
name: ${WORKLOAD_NAME}
5+
annotations:
6+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
7+
spec:
8+
ttlSecondsAfterFinished: 43200
9+
failurePolicy:
10+
rules:
11+
- action: FailJobSet
12+
onJobFailureReasons:
13+
- PodFailurePolicy
14+
maxRestarts: 0
15+
replicatedJobs:
16+
- name: slice-job
17+
replicas: 1
18+
template:
19+
spec:
20+
parallelism: 32 # Equal to the number of VMs per slice (or sub-slice).
21+
completions: 32 # Same as the above.
22+
backoffLimit: 0 # When any pod fails, the job is failed
23+
podFailurePolicy:
24+
rules:
25+
- action: FailJob
26+
onExitCodes:
27+
containerName: jax-tpu
28+
operator: NotIn
29+
values: [42,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
30+
template:
31+
spec:
32+
restartPolicy: Never
33+
nodeSelector:
34+
cloud.google.com/gke-tpu-accelerator: tpu7x
35+
cloud.google.com/gke-tpu-topology: 4x4x8
36+
hostNetwork: true
37+
dnsPolicy: ClusterFirstWithHostNet
38+
containers:
39+
- name: jax-tpu
40+
image: ${WORKLOAD_IMAGE}
41+
ports:
42+
- containerPort: 8471
43+
- containerPort: 8080
44+
securityContext:
45+
privileged: true
46+
command:
47+
- bash
48+
- -c
49+
- |
50+
echo XPK Start: $(date);
51+
_sigterm() (kill -SIGTERM $! 2>/dev/null;);
52+
trap _sigterm SIGTERM;
53+
(export TPU_STDERR_LOG_LEVEL=0 && export TPU_MIN_LOG_LEVEL=0 && export TF_CPP_MIN_LOG_LEVEL=0 && export TPU_VMODULE=real_program_continuator=1 && set -e && export ENABLE_PATHWAYS_PERSISTENCE='1' && export LIBTPU_INIT_ARGS='--xla_tpu_dvfs_p_state=3 --xla_tpu_scoped_vmem_limit_kib=65536 --xla_tpu_bf16_emission_mode=NATIVE_EMISSION --xla_tpu_enable_sparse_core_reduce_scatter_v2=true --xla_tpu_enable_sparse_core_collective_offload_all_gather=true --xla_tpu_enable_sparse_core_collective_offload_2d_all_gather=true --xla_tpu_enable_all_gather_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=True --xla_sc_disable_megacore_partitioning=True --xla_tpu_enable_async_collective_fusion_fuse_all_gather=false --xla_enable_async_all_gather=true --xla_tpu_prefer_async_allgather_to_allreduce=true --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_sparse_core_collective_offload_reduce_scatter=true --xla_tpu_enable_sparse_core_collective_offload_3d_all_gather=true --xla_tpu_use_single_sparse_core_for_all_gather_offload=true --xla_tpu_enable_concurrent_sparse_core_offloading=true --xla_tpu_aggressive_opt_barrier_removal=true --xla_tpu_enable_offloading_gather_to_sparsecore=true --xla_tpu_sparse_core_all_gather_latency_multiplier=1 --xla_tpu_sparse_core_reduce_scatter_latency_multiplier=3 --xla_tpu_enable_sparse_core_collective_aggregator=true --xla_tpu_enable_latency_hiding_layer_scheduler=true --xla_tpu_scheduler_percent_shared_memory_limit=150 --xla_tpu_enable_layer_scheduler_for_dependent_collectives=true --xla_tpu_enable_sparse_core_collective_offload_nd_reduce_scatter=true --xla_tpu_pcie_bandwidth_multiplier=0.03 --xla_tpu_enable_sparse_core_offload_queuing_in_lhs=true --xla_tpu_enable_multi_compute_overlap_in_layer_scheduler=false --xla_tpu_enable_3d_reduce_scatter_decomposer=false' && export JAX_PLATFORMS='tpu,cpu' && export ENABLE_PJRT_COMPATIBILITY='true' && python3 -m MaxText.train MaxText/configs/base.yml model_name=deepseek3-671b per_device_batch_size=8.0 max_target_length=4096 dcn_pipeline_parallelism=1 dcn_data_parallelism=-1 ici_pipeline_parallelism=1 ici_fsdp_transpose_parallelism=1 ici_fsdp_parallelism=-1 allow_split_physical_axes=True use_iota_embed=True remat_policy=custom decoder_layer_input=offload opt_type=adamw mu_dtype=bfloat16 grad_dtype=bfloat16 megablox=True sparse_matmul=True use_custom_sort_vjp=True fsdp_shard_on_exp=True sa_use_fused_bwd_kernel=True sa_block_q=2048 sa_block_kv=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_kv_dq=2048 sa_block_q_dq=2048 attention=flash use_tokamax_splash=True use_max_logit_estimate=-1 cost_estimate_flops_fwd=5000000000000 cost_estimate_flops_bwd=5000000000000 float32_weight_sum=False tile_batch_seq=512 tile_embed_dim=1024 tile_mlp_dim=2048 use_tokamax_gmm=True tokenizer_path=assets/tokenizer.mistral-v3 dataset_type=synthetic dataset_path=gs://max-datasets-rogue enable_checkpointing=False steps=30 base_output_directory=${BASE_OUTPUT_DIR} run_name=${WORKLOAD_NAME}) & PID=$!;
54+
while kill -0 $PID 2>/dev/null;
55+
do sleep 5;
56+
done;
57+
wait $PID;
58+
EXIT_CODE=$?;
59+
echo XPK End: $(date);
60+
echo EXIT_CODE=$EXIT_CODE;
61+
exit $EXIT_CODE
62+
resources:
63+
limits:
64+
google.com/tpu: 4
65+
volumeMounts:
66+
- mountPath: /dev/shm
67+
name: dshm-2
68+
tolerations:
69+
- operator: "Exists"
70+
key: google.com/tpu
71+
72+
volumes:
73+
- emptyDir:
74+
medium: Memory
75+
name: dshm-2

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/README.md renamed to training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Install XPK and necessary tools:
7878
# Ensure to log in to your gcloud
7979

8080
# Install latest xpk
81-
pip install xpk==0.14.3
81+
pip install xpk==0.16.0
8282

8383
# Install xpk pre-reqs kubectl-kueue and kjob (if you installed xpk via pip)
8484

@@ -197,7 +197,7 @@ The following software versions are used:
197197
- Jax version: 0.8.1
198198
- Maxtext version: maxtext-tutorial-v1.3.0
199199
- Python 3.11
200-
- XPK 0.14.3
200+
- XPK 0.16.0
201201

202202
Docker Image Building Command:
203203

training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/run_recipe.sh renamed to training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/run_recipe.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ source "${UV_VENV_PATH}/bin/activate"
1313
# Check if xpk is installed in the venv
1414
if ! pip show xpk &> /dev/null; then
1515
echo "xpk not found in the virtual environment. Please install it by running:"
16-
echo "pip install xpk==0.14.3"
16+
echo "pip install xpk==0.16.0"
1717
exit 1
1818
fi
1919
# --- End Environment Setup ---
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Pretrain llama3.1-70b workload on Ironwood GKE clusters with Kubernetes JobSet
2+
3+
This recipe outlines the steps for running a llama3.1-70b
4+
[MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on
5+
[Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine)
6+
by applying a Kubernetes manifest to deploy a JobSet resource.
7+
8+
9+
## Workload Details
10+
11+
This workload is configured with the following details:
12+
13+
- Sequence Length: 8192
14+
- Precision: bf16
15+
- Chips: 64 (4x4x4 topology)
16+
17+
## Prerequisites
18+
19+
This recipe assumes the following prerequisites are met:
20+
21+
- **GKE Cluster:** A GKE cluster with [JobSet](https://jobset.sigs.k8s.io/docs/installation/) installed and running.
22+
- **Container Image:** A pre-built container image (such as
23+
`gcr.io/my-project/my-maxtext-runner:latest`) containing the MaxText
24+
workload, accessible by the GKE cluster.
25+
- **Tools:** `gcloud`, `kubectl`, `gke-gcloud-auth-plugin`, and `envsubst`
26+
installed on your workstation. If `envsubst` is missing, install it with
27+
`sudo apt-get update && sudo apt-get install -y gettext-base`.
28+
- **Permissions:** You have permissions to run `kubectl apply` on the target
29+
cluster and the cluster has permissions to pull the container image.
30+
31+
## Orchestration and deployment tools
32+
33+
For this recipe, the following setup is used:
34+
35+
- **Orchestration** -
36+
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
37+
- **Pretraining job configuration and deployment** - A Kubernetes manifest
38+
(`k8s_manifest.yaml`) is used to define and deploy the
39+
[Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
40+
resource, which manages the execution of the MaxText pretraining workload.
41+
42+
## Training dataset
43+
44+
This recipe uses a mock pretraining dataset provided by the MaxText framework.
45+
46+
## Run the recipe
47+
48+
This recipe uses a Kubernetes manifest (`k8s_manifest.yaml`) to deploy the
49+
workload. The following commands will set the required environment variables,
50+
substitute them into `k8s_manifest.yaml`, and apply the resulting
51+
configuration to your cluster.
52+
53+
### 1. Configure Environment Variables
54+
55+
Open a terminal and set the following environment variables to match your setup.
56+
**Note:**
57+
- `k8s_manifest.yaml` is in the same directory as this README.
58+
- For WORKLOAD_IMAGE, see [Docker container image](../xpk/README.md#docker-container-image) section.
59+
60+
```bash
61+
# Set variables for your environment
62+
export PROJECT_ID="" # Your GCP project name
63+
export CLUSTER_NAME="" # The name of your GKE cluster
64+
export ZONE="" # The zone of your GKE cluster
65+
export BASE_OUTPUT_DIR="" # e.g., "gs://your-bucket-name/my-base-output-dir"
66+
export WORKLOAD_IMAGE="" # e.g., "gcr.io/my-project/my-maxtext-runner:latest"
67+
68+
# Set workload name (or modify as needed, make sure its unique in the cluster)
69+
export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-llama3-1-70b-8192-4x4x4")-$(date +%Y%m%d-%H%M)"
70+
```
71+
72+
### 2. Run llama3.1-70b Pretraining Workload
73+
74+
Once the environment variables are set, run the following commands to fetch
75+
cluster credentials and deploy the JobSet:
76+
77+
```bash
78+
# Fetch cluster credentials
79+
gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${ZONE} --project ${PROJECT_ID}
80+
81+
# Apply the manifest
82+
envsubst '${BASE_OUTPUT_DIR} ${WORKLOAD_NAME} ${WORKLOAD_IMAGE}' < k8s_manifest.yaml | kubectl apply -n default -f -
83+
```
84+
85+
## Monitor the job
86+
87+
To monitor your job's progress, you can use kubectl to check the Jobset status
88+
and logs:
89+
90+
```bash
91+
# Check JobSet status
92+
kubectl get jobset -n default ${WORKLOAD_NAME}
93+
94+
# Get the name of the first pod in the JobSet
95+
POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=${WORKLOAD_NAME} -n default -o jsonpath='{.items[0].metadata.name}')
96+
97+
# Follow the logs of that pod
98+
kubectl logs -f -n default ${POD_NAME}
99+
```
100+
101+
You can also monitor your cluster and TPU usage through the Google Cloud
102+
Console:
103+
`https://console.cloud.google.com/kubernetes/workload/overview?project={PROJECT_ID}`
104+
105+
## Delete resources
106+
107+
### Delete a specific workload
108+
109+
To delete the JobSet created by this recipe, run:
110+
111+
```bash
112+
kubectl delete jobset ${WORKLOAD_NAME} -n default
113+
```
114+
115+
## Check results
116+
117+
After the job completes, you can check the results by:
118+
119+
- Accessing output logs from your job using `kubectl logs`.
120+
- Checking any data stored in the Google Cloud Storage bucket specified by the
121+
`${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`.
122+
- Reviewing metrics in Cloud Monitoring, if configured.
123+
124+
## Next steps: deeper exploration and customization
125+
126+
This recipe is designed to provide a simple, reproducible "0-to-1" experience
127+
for running a MaxText pre-training workload. Its primary purpose is to help you
128+
verify your environment and achieve a first success with TPUs quickly and
129+
reliably.
130+
131+
For deeper exploration, including customizing model configurations, tuning
132+
performance with different XLA flags, and running custom experiments, we
133+
recommend using the benchmark_runner.py script directly from the MaxText
134+
repository. This script offers the full range of MaxText's flexibility and is
135+
the ideal tool for power users and researchers who want to move beyond the
136+
initial benchmark and tailor the workload to their specific needs. To learn
137+
more, see the
138+
[MaxText Benchmark Runner Guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/Getting_Started_Benchmarking.md)
139+
on using benchmark_runner.py for advanced benchmarking.

0 commit comments

Comments
 (0)