AI-Hypercomputer
diff --git a/‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/k8s/README.md‎
Lines changed: 138 additions & 0 deletions b/‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/k8s/README.md‎
Lines changed: 138 additions & 0 deletions
diff --git a/‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/k8s/k8s_manifest.yaml‎
Lines changed: 75 additions & 0 deletions b/‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/k8s/k8s_manifest.yaml‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎…seek3-671b/4k-bf16-tpu7x-4x4x8/README.md‎ ‎…3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/README.md renamed to training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md
Lines changed: 2 additions & 2 deletions b/‎…seek3-671b/4k-bf16-tpu7x-4x4x8/README.md‎ ‎…3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/README.md renamed to training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎…3-671b/4k-bf16-tpu7x-4x4x8/run_recipe.sh‎ ‎…1b/4k-bf16-tpu7x-4x4x8/xpk/run_recipe.sh‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/run_recipe.sh renamed to training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/run_recipe.sh
Lines changed: 1 addition & 1 deletion b/‎…3-671b/4k-bf16-tpu7x-4x4x8/run_recipe.sh‎ ‎…1b/4k-bf16-tpu7x-4x4x8/xpk/run_recipe.sh‎training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/run_recipe.sh renamed to training/ironwood/deepseek3-671b/4k-bf16-tpu7x-4x4x8/xpk/run_recipe.sh
Lines changed: 1 addition & 1 deletion
diff --git a/‎training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/k8s/README.md‎
Lines changed: 139 additions & 0 deletions b/‎training/ironwood/llama3.1-70b/8k-bf16-tpu7x-4x4x4/k8s/README.md‎
Lines changed: 139 additions & 0 deletions
@@ -0,0 +1,138 @@
+# Pretrain deepseek3-671b workload on Ironwood GKE clusters with Kubernetes JobSet
+
+This recipe outlines the steps for running a deepseek3-671b
+[MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on
+[Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine)
+by applying a Kubernetes manifest to deploy a JobSet resource.
+
+## Workload Details
+
+This workload is configured with the following details:
+
+-   Sequence Length: 4096
+-   Precision: bf16
+-   Chips: 128 (4x4x8 topology)
+
+## Prerequisites
+
+This recipe assumes the following prerequisites are met:
+
+-   **GKE Cluster:** A GKE cluster with [JobSet](https://jobset.sigs.k8s.io/docs/installation/) installed and running.
+-   **Container Image:** A pre-built container image (such as
+    `gcr.io/my-project/my-maxtext-runner:latest`) containing the MaxText
+    workload, accessible by the GKE cluster.
+-   **Tools:** `gcloud`, `kubectl`, `gke-gcloud-auth-plugin`, and `envsubst`
+    installed on your workstation. If `envsubst` is missing, install it with
+    `sudo apt-get update && sudo apt-get install -y gettext-base`.
+-   **Permissions:** You have permissions to run `kubectl apply` on the target
+    cluster and the cluster has permissions to pull the container image.
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+-   **Orchestration** -
+    [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+-   **Pretraining job configuration and deployment** - A Kubernetes manifest
+    (`k8s_manifest.yaml`) is used to define and deploy the
+    [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
+    resource, which manages the execution of the MaxText pretraining workload.
+
+## Training dataset
+
+This recipe uses a mock pretraining dataset provided by the MaxText framework.
+
+## Run the recipe
+
+This recipe uses a Kubernetes manifest (`k8s_manifest.yaml`) to deploy the
+workload. The following commands will set the required environment variables,
+substitute them into `k8s_manifest.yaml`, and apply the resulting
+configuration to your cluster.
+
+### 1. Configure Environment Variables
+
+Open a terminal and set the following environment variables to match your setup.
+**Note:** 
+- `k8s_manifest.yaml` is in the same directory as this README.  
+- For WORKLOAD_IMAGE, see [Docker container image](../xpk/README.md#docker-container-image) section.
+
+```bash
+# Set variables for your environment
+export PROJECT_ID=""    # Your GCP project name
+export CLUSTER_NAME=""  # The name of your GKE cluster
+export ZONE=""          # The zone of your GKE cluster
+export BASE_OUTPUT_DIR=""    # e.g., "gs://your-bucket-name/my-base-output-dir"
+export WORKLOAD_IMAGE=""   # e.g., "gcr.io/my-project/my-maxtext-runner:latest".
+
+# Set workload name (or modify as needed, make sure its unique in the cluster)
+export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-deepseekv3-671b-4096-fsdp")-$(date +%Y%m%d-%H%M)"
+```
+
+### 2. Run deepseekv3-671b Pretraining Workload
+
+Once the environment variables are set, run the following commands to fetch
+cluster credentials and deploy the JobSet:
+
+```bash
+# Fetch cluster credentials
+gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${ZONE} --project ${PROJECT_ID}
+
+# Apply the manifest
+envsubst '${BASE_OUTPUT_DIR} ${WORKLOAD_NAME} ${WORKLOAD_IMAGE}' < k8s_manifest.yaml | kubectl apply -n default -f -
+```
+
+## Monitor the job
+
+To monitor your job's progress, you can use kubectl to check the Jobset status
+and logs:
+
+```bash
+# Check JobSet status
+kubectl get jobset -n default ${WORKLOAD_NAME}
+
+# Get the name of the first pod in the JobSet
+POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=${WORKLOAD_NAME} -n default -o jsonpath='{.items[0].metadata.name}')
+
+# Follow the logs of that pod
+kubectl logs -f -n default ${POD_NAME}
+```
+
+You can also monitor your cluster and TPU usage through the Google Cloud
+Console:
+`https://console.cloud.google.com/kubernetes/workload/overview?project={PROJECT_ID}`
+
+## Delete resources
+
+### Delete a specific workload
+
+To delete the JobSet created by this recipe, run:
+
+```bash
+kubectl delete jobset ${WORKLOAD_NAME} -n default
+```
+
+## Check results
+
+After the job completes, you can check the results by:
+
+-   Accessing output logs from your job using `kubectl logs`.
+-   Checking any data stored in the Google Cloud Storage bucket specified by the
+    `${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`.
+-   Reviewing metrics in Cloud Monitoring, if configured.
+
+## Next steps: deeper exploration and customization
+
+This recipe is designed to provide a simple, reproducible "0-to-1" experience
+for running a MaxText pre-training workload. Its primary purpose is to help you
+verify your environment and achieve a first success with TPUs quickly and
+reliably.
+
+For deeper exploration, including customizing model configurations, tuning
+performance with different XLA flags, and running custom experiments, we
+recommend using the benchmark_runner.py script directly from the MaxText
+repository. This script offers the full range of MaxText's flexibility and is
+the ideal tool for power users and researchers who want to move beyond the
+initial benchmark and tailor the workload to their specific needs. To learn
+more, see the
+[MaxText Benchmark Runner Guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/Getting_Started_Benchmarking.md)
+on using benchmark_runner.py for advanced benchmarking.
@@ -0,0 +1,75 @@
+apiVersion: jobset.x-k8s.io/v1alpha2
+kind: JobSet
+metadata:
+  name: ${WORKLOAD_NAME}
+  annotations:
+    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
+spec:
+  ttlSecondsAfterFinished: 43200
+  failurePolicy:
+    rules:
+      - action: FailJobSet
+        onJobFailureReasons:
+        - PodFailurePolicy
+    maxRestarts: 0
+  replicatedJobs:
+    - name: slice-job
+      replicas: 1
+      template:
+        spec:
+          parallelism: 32    # Equal to the number of VMs per slice (or sub-slice).
+          completions: 32    # Same as the above.
+          backoffLimit: 0   # When any pod fails, the job is failed
+          podFailurePolicy:
+            rules:
+            - action: FailJob
+              onExitCodes:
+                containerName: jax-tpu
+                operator: NotIn
+                values: [42,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
+          template:
+            spec:
+              restartPolicy: Never
+              nodeSelector:
+                cloud.google.com/gke-tpu-accelerator: tpu7x
+                cloud.google.com/gke-tpu-topology: 4x4x8
+              hostNetwork: true
+              dnsPolicy: ClusterFirstWithHostNet
+              containers:
+              - name: jax-tpu
+                image: ${WORKLOAD_IMAGE}
+                ports:
+                - containerPort: 8471
+                - containerPort: 8080
+                securityContext:
+                  privileged: true
+                command:
+                - bash
+                - -c
+                - |
+                  echo XPK Start: $(date);
+                  _sigterm() (kill -SIGTERM $! 2>/dev/null;);
+                  trap _sigterm SIGTERM;
+                  (export TPU_STDERR_LOG_LEVEL=0 && export TPU_MIN_LOG_LEVEL=0 && export TF_CPP_MIN_LOG_LEVEL=0 && export TPU_VMODULE=real_program_continuator=1 && set -e && export ENABLE_PATHWAYS_PERSISTENCE='1' && export LIBTPU_INIT_ARGS='--xla_tpu_dvfs_p_state=3 --xla_tpu_scoped_vmem_limit_kib=65536 --xla_tpu_bf16_emission_mode=NATIVE_EMISSION --xla_tpu_enable_sparse_core_reduce_scatter_v2=true --xla_tpu_enable_sparse_core_collective_offload_all_gather=true --xla_tpu_enable_sparse_core_collective_offload_2d_all_gather=true --xla_tpu_enable_all_gather_offload_tracing=true --xla_tpu_use_tc_device_shape_on_sc=True --xla_sc_disable_megacore_partitioning=True --xla_tpu_enable_async_collective_fusion_fuse_all_gather=false --xla_enable_async_all_gather=true --xla_tpu_prefer_async_allgather_to_allreduce=true --xla_tpu_enable_sparse_core_collective_offload_all_reduce=true --xla_tpu_enable_sparse_core_collective_offload_reduce_scatter=true --xla_tpu_enable_sparse_core_collective_offload_3d_all_gather=true --xla_tpu_use_single_sparse_core_for_all_gather_offload=true --xla_tpu_enable_concurrent_sparse_core_offloading=true --xla_tpu_aggressive_opt_barrier_removal=true --xla_tpu_enable_offloading_gather_to_sparsecore=true --xla_tpu_sparse_core_all_gather_latency_multiplier=1 --xla_tpu_sparse_core_reduce_scatter_latency_multiplier=3 --xla_tpu_enable_sparse_core_collective_aggregator=true --xla_tpu_enable_latency_hiding_layer_scheduler=true --xla_tpu_scheduler_percent_shared_memory_limit=150 --xla_tpu_enable_layer_scheduler_for_dependent_collectives=true --xla_tpu_enable_sparse_core_collective_offload_nd_reduce_scatter=true --xla_tpu_pcie_bandwidth_multiplier=0.03 --xla_tpu_enable_sparse_core_offload_queuing_in_lhs=true --xla_tpu_enable_multi_compute_overlap_in_layer_scheduler=false --xla_tpu_enable_3d_reduce_scatter_decomposer=false' && export JAX_PLATFORMS='tpu,cpu' && export ENABLE_PJRT_COMPATIBILITY='true' && python3 -m MaxText.train MaxText/configs/base.yml model_name=deepseek3-671b per_device_batch_size=8.0 max_target_length=4096 dcn_pipeline_parallelism=1 dcn_data_parallelism=-1 ici_pipeline_parallelism=1 ici_fsdp_transpose_parallelism=1 ici_fsdp_parallelism=-1 allow_split_physical_axes=True use_iota_embed=True remat_policy=custom decoder_layer_input=offload opt_type=adamw mu_dtype=bfloat16 grad_dtype=bfloat16 megablox=True sparse_matmul=True use_custom_sort_vjp=True fsdp_shard_on_exp=True sa_use_fused_bwd_kernel=True sa_block_q=2048 sa_block_kv=2048 sa_block_q_dkv=2048 sa_block_kv_dkv=2048 sa_block_kv_dkv_compute=2048 sa_block_kv_dq=2048 sa_block_q_dq=2048 attention=flash use_tokamax_splash=True use_max_logit_estimate=-1 cost_estimate_flops_fwd=5000000000000 cost_estimate_flops_bwd=5000000000000 float32_weight_sum=False tile_batch_seq=512 tile_embed_dim=1024 tile_mlp_dim=2048 use_tokamax_gmm=True tokenizer_path=assets/tokenizer.mistral-v3 dataset_type=synthetic dataset_path=gs://max-datasets-rogue enable_checkpointing=False steps=30 base_output_directory=${BASE_OUTPUT_DIR} run_name=${WORKLOAD_NAME}) & PID=$!;
+                  while kill -0 $PID 2>/dev/null;
+                      do sleep 5;
+                  done;
+                  wait $PID;
+                  EXIT_CODE=$?;
+                  echo XPK End: $(date);
+                  echo EXIT_CODE=$EXIT_CODE;
+                  exit $EXIT_CODE
+                resources:
+                  limits:
+                    google.com/tpu: 4
+                volumeMounts:
+                - mountPath: /dev/shm
+                  name: dshm-2
+              tolerations:
+              - operator: "Exists"
+                key: google.com/tpu
+        
+              volumes:
+              - emptyDir:
+                  medium: Memory
+                name: dshm-2
@@ -78,7 +78,7 @@ Install XPK and necessary tools:
 # Ensure to log in to your gcloud
 
 # Install latest xpk
-pip install xpk==0.14.3
+pip install xpk==0.16.0
 
 # Install xpk pre-reqs kubectl-kueue and kjob (if you installed xpk via pip)
 
@@ -197,7 +197,7 @@ The following software versions are used:
 -   Jax version: 0.8.1
 -   Maxtext version: maxtext-tutorial-v1.3.0
 -   Python 3.11
--   XPK 0.14.3
+-   XPK 0.16.0
 
 Docker Image Building Command:
 
 
@@ -13,7 +13,7 @@ source "${UV_VENV_PATH}/bin/activate"
 # Check if xpk is installed in the venv
 if ! pip show xpk &> /dev/null; then
     echo "xpk not found in the virtual environment. Please install it by running:"
-    echo "pip install xpk==0.14.3"
+    echo "pip install xpk==0.16.0"
     exit 1
 fi
 # --- End Environment Setup ---
 
@@ -0,0 +1,139 @@
+# Pretrain llama3.1-70b workload on Ironwood GKE clusters with Kubernetes JobSet
+
+This recipe outlines the steps for running a llama3.1-70b
+[MaxText](https://github.com/AI-Hypercomputer/maxtext) pretraining workload on
+[Ironwood GKE clusters](https://cloud.google.com/kubernetes-engine)
+by applying a Kubernetes manifest to deploy a JobSet resource.
+
+
+## Workload Details
+
+This workload is configured with the following details:
+
+-   Sequence Length: 8192
+-   Precision: bf16
+-   Chips: 64 (4x4x4 topology)
+
+## Prerequisites
+
+This recipe assumes the following prerequisites are met:
+
+-   **GKE Cluster:** A GKE cluster with [JobSet](https://jobset.sigs.k8s.io/docs/installation/) installed and running.
+-   **Container Image:** A pre-built container image (such as
+    `gcr.io/my-project/my-maxtext-runner:latest`) containing the MaxText
+    workload, accessible by the GKE cluster.
+-   **Tools:** `gcloud`, `kubectl`, `gke-gcloud-auth-plugin`, and `envsubst`
+    installed on your workstation. If `envsubst` is missing, install it with
+    `sudo apt-get update && sudo apt-get install -y gettext-base`.
+-   **Permissions:** You have permissions to run `kubectl apply` on the target
+    cluster and the cluster has permissions to pull the container image.
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+-   **Orchestration** -
+    [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+-   **Pretraining job configuration and deployment** - A Kubernetes manifest
+    (`k8s_manifest.yaml`) is used to define and deploy the
+    [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
+    resource, which manages the execution of the MaxText pretraining workload.
+
+## Training dataset
+
+This recipe uses a mock pretraining dataset provided by the MaxText framework.
+
+## Run the recipe
+
+This recipe uses a Kubernetes manifest (`k8s_manifest.yaml`) to deploy the
+workload. The following commands will set the required environment variables,
+substitute them into `k8s_manifest.yaml`, and apply the resulting
+configuration to your cluster.
+
+### 1. Configure Environment Variables
+
+Open a terminal and set the following environment variables to match your setup.
+**Note:** 
+- `k8s_manifest.yaml` is in the same directory as this README.  
+- For WORKLOAD_IMAGE, see [Docker container image](../xpk/README.md#docker-container-image) section.
+
+```bash
+# Set variables for your environment
+export PROJECT_ID=""    # Your GCP project name
+export CLUSTER_NAME=""  # The name of your GKE cluster
+export ZONE=""          # The zone of your GKE cluster
+export BASE_OUTPUT_DIR=""    # e.g., "gs://your-bucket-name/my-base-output-dir"
+export WORKLOAD_IMAGE=""   # e.g., "gcr.io/my-project/my-maxtext-runner:latest"
+
+# Set workload name (or modify as needed, make sure its unique in the cluster)
+export WORKLOAD_NAME="$(printf "%.26s" "${USER//_/-}-llama3-1-70b-8192-4x4x4")-$(date +%Y%m%d-%H%M)"
+```
+
+### 2. Run llama3.1-70b Pretraining Workload
+
+Once the environment variables are set, run the following commands to fetch
+cluster credentials and deploy the JobSet:
+
+```bash
+# Fetch cluster credentials
+gcloud container clusters get-credentials ${CLUSTER_NAME} --zone ${ZONE} --project ${PROJECT_ID}
+
+# Apply the manifest
+envsubst '${BASE_OUTPUT_DIR} ${WORKLOAD_NAME} ${WORKLOAD_IMAGE}' < k8s_manifest.yaml | kubectl apply -n default -f -
+```
+
+## Monitor the job
+
+To monitor your job's progress, you can use kubectl to check the Jobset status
+and logs:
+
+```bash
+# Check JobSet status
+kubectl get jobset -n default ${WORKLOAD_NAME}
+
+# Get the name of the first pod in the JobSet
+POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=${WORKLOAD_NAME} -n default -o jsonpath='{.items[0].metadata.name}')
+
+# Follow the logs of that pod
+kubectl logs -f -n default ${POD_NAME}
+```
+
+You can also monitor your cluster and TPU usage through the Google Cloud
+Console:
+`https://console.cloud.google.com/kubernetes/workload/overview?project={PROJECT_ID}`
+
+## Delete resources
+
+### Delete a specific workload
+
+To delete the JobSet created by this recipe, run:
+
+```bash
+kubectl delete jobset ${WORKLOAD_NAME} -n default
+```
+
+## Check results
+
+After the job completes, you can check the results by:
+
+-   Accessing output logs from your job using `kubectl logs`.
+-   Checking any data stored in the Google Cloud Storage bucket specified by the
+    `${BASE_OUTPUT_DIR}` variable in your `run_recipe.sh`.
+-   Reviewing metrics in Cloud Monitoring, if configured.
+
+## Next steps: deeper exploration and customization
+
+This recipe is designed to provide a simple, reproducible "0-to-1" experience
+for running a MaxText pre-training workload. Its primary purpose is to help you
+verify your environment and achieve a first success with TPUs quickly and
+reliably.
+
+For deeper exploration, including customizing model configurations, tuning
+performance with different XLA flags, and running custom experiments, we
+recommend using the benchmark_runner.py script directly from the MaxText
+repository. This script offers the full range of MaxText's flexibility and is
+the ideal tool for power users and researchers who want to move beyond the
+initial benchmark and tailor the workload to their specific needs. To learn
+more, see the
+[MaxText Benchmark Runner Guide](https://github.com/AI-Hypercomputer/maxtext/blob/main/benchmarks/Getting_Started_Benchmarking.md)
+on using benchmark_runner.py for advanced benchmarking.