@@ -33,89 +33,8 @@ steps, you must use the same one to run the steps in the [MAXTEXT_README](MAXTEX
3333as well as your relevant tpu-recipe workloads.
3434
3535## GKE Cluster Creation
36- 1 . Specify your TPU GKE cluster configs.
37- ``` shell
38- export CLUSTER_NAME=v6e-demo # <your_cluster_name>
39- export NETWORK_NAME=${CLUSTER_NAME} -only-mtu9k
40- export NETWORK_FW_NAME=${NETWORK_NAME} -only-fw
41- export CLUSTER_ARGUMENTS=" --network=${NETWORK_NAME} --subnetwork=${NETWORK_NAME} "
42- export TPU_TYPE=v6e-256 # <your TPU Type>
43- export NUM_SLICES=1 # <number of TPU node-pools you want to create>
44- export ZONE=< compute_zone>
45- export REGION=< compute_region>
46- ```
47-
48- 2 . Create the network and firewall for this cluster if it doesn’t exist yet.
49- ``` shell
50- NETWORK_NAME_1=${CLUSTER_NAME} -mtu9k-1-${ZONE}
51- NETWORK_FW_NAME_1=${NETWORK_NAME_1} -fw-1-${ZONE}
52-
53- # Use a custom network for better performance as well as avoid the default network to be overloaded.
54- gcloud compute networks create ${NETWORK_NAME_1} --mtu=8896 --project=${PROJECT} --subnet-mode=auto --bgp-routing-mode=regional
55- gcloud compute firewall-rules create ${NETWORK_FW_NAME_1} --network ${NETWORK_NAME_1} --allow tcp,icmp,udp --project=${PROJECT}
56-
57- # Secondary subnet for multinic experience. Need custom ip routing to be different from first network’s subnet.
58- export NETWORK_NAME_2=${CLUSTER_NAME} -privatenetwork-2-${ZONE}
59- export SUBNET_NAME_2=${CLUSTER_NAME} -privatesubnet-2-${ZONE}
60- export FIREWALL_RULE_NAME=${CLUSTER_NAME} -privatefirewall-2-${ZONE}
61- export ROUTER_NAME=${CLUSTER_NAME} -network-2-${ZONE}
62- export NAT_CONFIG=${CLUSTER_NAME} -natconfig-2-${ZONE}
63-
64- gcloud compute networks create " ${NETWORK_NAME_2} " --mtu=8896 --bgp-routing-mode=regional --subnet-mode=custom --project=$PROJECT
65- gcloud compute networks subnets create " ${SUBNET_NAME_2} " --network=" ${NETWORK_NAME_2} " --range=10.10.0.0/18 --region=" ${REGION} " --project=$PROJECT
66- gcloud compute firewall-rules create " ${FIREWALL_RULE_NAME} " --network " ${NETWORK_NAME_2} " --allow tcp,icmp,udp --project=" ${PROJECT} "
67- gcloud compute routers create " ${ROUTER_NAME} " \
68- --project=" ${PROJECT} " \
69- --network=" ${NETWORK_NAME_2} " \
70- --region=" ${REGION} "
71- gcloud compute routers nats create " ${NAT_CONFIG} " \
72- --router=" ${ROUTER_NAME} " \
73- --region=" ${REGION} " \
74- --auto-allocate-nat-external-ips \
75- --nat-all-subnet-ip-ranges \
76- --project=" ${PROJECT} " \
77- --enable-logging
78- ```
79-
80- 3 . Create GKE cluster with TPU node-pools
81- ``` shell
82- export CLUSTER_ARGUMENTS=" --enable-dataplane-v2 --enable-ip-alias --enable-multi-networking --network=${NETWORK_NAME_1} --subnetwork=${NETWORK_NAME_1} "
83-
84- export NODE_POOL_ARGUMENTS=" --additional-node-network network=${NETWORK_NAME_2} ,subnetwork=${SUBNET_NAME_2} "
85-
86- python3 xpk.py cluster create --cluster $CLUSTER_NAME --cluster-cpu-machine-type=n1-standard-8 --num-slices=$NUM_SLICES --tpu-type=$TPU_TYPE --zone=$ZONE --project=$PROJECT --on-demand --custom-cluster-arguments=" ${CLUSTER_ARGUMENTS} " --custom-nodepool-arguments=" ${NODE_POOL_ARGUMENTS} " --create-vertex-tensorboard
87- ```
88-
89- * Noted: TPU has ` reserved ` , ` on-demand ` , ` spot ` quota. This example used the ` on-demand ` quota. If you have the reserved or spot quota, please refer to this [ link] ( https://github.com/google/xpk?tab=readme-ov-file#cluster-create ) .
90- * If you want to check what quota you have, please refer to this [ link] ( https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#ensure-quota ) .
91- * You should be able to see your GKE cluster similar to this once it is created successfully:![ image] ( https://github.com/user-attachments/assets/60743411-5ee5-4391-bb0e-7ffba4d91c1d )
92-
93- 4 . Performance Daemonset
94- ``` shell
95- kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/9ff340f07f70be0130454f9e7238551587242b75/scripts/network-setup/v6e-network-optimization.yaml
96- ```
97-
98- 5 . Test your GKE cluster to make sure it is usable
99- ``` shell
100- python3 xpk.py workload create \
101- --cluster ${CLUSTER_NAME} \
102- --workload hello-world-test \
103- --tpu-type=${TPU_TYPE} \
104- --num-slices=${NUM_SLICES} \
105- --command " echo Hello World"
106- ```
107- * You should be able to to see results like this: ![ image] ( https://github.com/user-attachments/assets/c33010a6-e109-411e-8fb5-afb4edb3fa72 )
108-
109- 6 . You can also check your workload status with the following command:
110- ``` shell
111- python3 xpk.py workload list --cluster ${CLUSTER_NAME}
112- ```
113- 7 . For more information about XPK, please refer to this [ link] ( https://github.com/google/xpk ) .
114-
115- ## GKE Cluster Deletion
116- You can use the following command to delete GKE cluster:
117- ``` shell
118- export CLUSTER_NAME=v6e-demo # <your_cluster_name>
36+ Trillium GKE clusters can be [ created] ( https://cloud.google.com/tpu/docs/v6e-intro#create_an_xpk_cluster_with_multi-nic_support ) and
37+ [ deleted] ( https://cloud.google.com/tpu/docs/v6e-intro#delete_xpk_cluster ) by following the public GCP documentation.
11938
120- python3 xpk.py cluster delete --cluster $CLUSTER_NAME
121- ```
39+ > Note: in order to run the training and microbenchmarks tpu-recipes, you should not need to run sections outside of
40+ ` Create an XPK cluster with multi-NIC support ` when creating your cluster. You can skip the following sections like ` Framework setup ` .
0 commit comments