Skip to content

Commit 93a1ba5

Browse files
authored
Merge pull request #62 from AI-Hypercomputer/bvandermoon-recipes
Add bash setup.sh to MaxText installation
2 parents 3989e9d + ceb0a32 commit 93a1ba5

File tree

2 files changed

+40
-25
lines changed

2 files changed

+40
-25
lines changed

training/trillium/MAXTEXT_README.md

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,50 @@
11
# Prep for Maxtext workloads on GKE
22
1. Clone [Maxtext](https://github.com/google/maxtext) repo and move to its directory
3-
```
3+
```shell
44
git clone https://github.com/google/maxtext.git
55
cd maxtext
66
# Checkout either the commit id or MaxText tag.
7-
# Example: `git checkout tpu-recipes-v0.1.0`
7+
# Example: `git checkout tpu-recipes-v0.1.1`
88
git checkout ${MAXTEXT_COMMIT_ID_OR_TAG}
99
```
1010

11-
2. Run the following commands to build the docker image
11+
2. Install MaxText dependencies
12+
```shell
13+
bash setup.sh
14+
```
15+
16+
Optional: Use a virtual environment to setup and run your workloads. This can help with errors
17+
like `This environment is externally managed`.
18+
```shell
19+
## One time step of creating the venv
20+
VENV_DIR=~/venvp3
21+
python3 -m venv $VENV_DIR
22+
## Enter your venv.
23+
source $VENV_DIR/bin/activate
24+
## Install dependencies
25+
bash setup.sh
1226
```
13-
# Example BASE_IMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-stable-stack/tpu:jax0.4.35-rev1
27+
28+
3. Run the following commands to build the docker image
29+
```shell
30+
# Example BASE_IMAGE=us-docker.pkg.dev/cloud-tpu-images/jax-stable-stack/tpu:jax0.5.2-rev1
1431
BASE_IMAGE=<stable_stack_image_with_desired_jax_version>
1532
bash docker_build_dependency_image.sh DEVICE=tpu MODE=stable_stack BASEIMAGE=${BASE_IMAGE}
1633
```
1734

18-
3. Upload your docker image to Container Registry
19-
```
35+
4. Upload your docker image to Container Registry
36+
```shell
2037
bash docker_upload_runner.sh CLOUD_IMAGE_NAME=${USER}_runner
2138
```
2239

23-
4. Create your GCS bucket
24-
```
40+
5. Create your GCS bucket
41+
```shell
2542
OUTPUT_DIR=gs://v6e-demo-run #<your_GCS_folder_for_results>
2643
gcloud storage buckets create ${OUTPUT_DIR} --project ${PROJECT}
2744
```
2845

29-
5. Specify your workload configs
30-
```
46+
6. Specify your workload configs
47+
```shell
3148
export PROJECT=#<your_compute_project>
3249
export ZONE=#<your_compute_zone>
3350
export CLUSTER_NAME=v6e-demo #<your_cluster_name>

training/trillium/XPK_README.md

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Initialization
22
1. Run the following commands to initialize the project and zone.
3-
```
3+
```shell
44
export PROJECT=#<your_project_id>
55
export ZONE=#<zone>
66
gcloud config set project $PROJECT
@@ -11,19 +11,19 @@ gcloud config set compute/zone $ZONE
1111
instructions. Also ensure you have the proper [GCP permissions](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#installation).
1212

1313
* In order to run the tpu-recipes as-is, run the `git clone` command from your home directory:
14-
```
14+
```shell
1515
git clone https://github.com/google/xpk.git
1616
```
1717

1818
3. Run the rest of these commands from the cloned XPK directory:
1919

20-
```
20+
```shell
2121
cd xpk # Should be equivalent to cd ~/xpk
2222
```
2323

2424
## GKE Cluster Creation
2525
1. Specify your TPU GKE cluster configs.
26-
```
26+
```shell
2727
export CLUSTER_NAME=v6e-demo #<your_cluster_name>
2828
export NETWORK_NAME=${CLUSTER_NAME}-only-mtu9k
2929
export NETWORK_FW_NAME=${NETWORK_NAME}-only-fw
@@ -35,7 +35,7 @@ export REGION=<compute_region>
3535
```
3636

3737
2. Create the network and firewall for this cluster if it doesn’t exist yet.
38-
```
38+
```shell
3939
NETWORK_NAME_1=${CLUSTER_NAME}-mtu9k-1-${ZONE}
4040
NETWORK_FW_NAME_1=${NETWORK_NAME_1}-fw-1-${ZONE}
4141

@@ -67,7 +67,7 @@ gcloud compute routers nats create "${NAT_CONFIG}" \
6767
```
6868

6969
3. Create GKE cluster with TPU node-pools
70-
```
70+
```shell
7171
export CLUSTER_ARGUMENTS="--enable-dataplane-v2 --enable-ip-alias --enable-multi-networking --network=${NETWORK_NAME_1} --subnetwork=${NETWORK_NAME_1}"
7272

7373
export NODE_POOL_ARGUMENTS="--additional-node-network network=${NETWORK_NAME_2},subnetwork=${SUBNET_NAME_2}"
@@ -80,12 +80,12 @@ python3 xpk.py cluster create --cluster $CLUSTER_NAME --cluster-cpu-machine-type
8080
* You should be able to see your GKE cluster similar to this once it is created successfully:![image](https://github.com/user-attachments/assets/60743411-5ee5-4391-bb0e-7ffba4d91c1d)
8181

8282
4. Performance Daemonset
83-
```
83+
```shell
8484
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/9ff340f07f70be0130454f9e7238551587242b75/scripts/network-setup/v6e-network-optimization.yaml
8585
```
8686

8787
5. Test your GKE cluster to make sure it is usable
88-
```
88+
```shell
8989
python3 xpk.py workload create \
9090
--cluster ${CLUSTER_NAME} \
9191
--workload hello-world-test \
@@ -96,17 +96,15 @@ python3 xpk.py workload create \
9696
* You should be able to to see results like this: ![image](https://github.com/user-attachments/assets/c33010a6-e109-411e-8fb5-afb4edb3fa72)
9797

9898
6. You can also check your workload status with the following command:
99-
```
100-
python3 xpk.py workload list \
101-
--cluster ${CLUSTER_NAME}
102-
```
99+
```shell
100+
python3 xpk.py workload list --cluster ${CLUSTER_NAME}
101+
```
103102
7. For more information about XPK, please refer to this [link](https://github.com/google/xpk).
104103

105104
## GKE Cluster Deletion
106105
You can use the following command to delete GKE cluster:
107-
```
106+
```shell
108107
export CLUSTER_NAME=v6e-demo #<your_cluster_name>
109108

110-
python3 xpk.py cluster delete \
111-
--cluster $CLUSTER_NAME
109+
python3 xpk.py cluster delete --cluster $CLUSTER_NAME
112110
```

0 commit comments

Comments
 (0)