Skip to content

Commit e0a003b

Browse files
committed
Use public GCP documentation to create v6e clusters
1 parent e3d3560 commit e0a003b

File tree

1 file changed

+4
-85
lines changed

1 file changed

+4
-85
lines changed

training/trillium/XPK_README.md

Lines changed: 4 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -33,89 +33,8 @@ steps, you must use the same one to run the steps in the [MAXTEXT_README](MAXTEX
3333
as well as your relevant tpu-recipe workloads.
3434

3535
## GKE Cluster Creation
36-
1. Specify your TPU GKE cluster configs.
37-
```shell
38-
export CLUSTER_NAME=v6e-demo #<your_cluster_name>
39-
export NETWORK_NAME=${CLUSTER_NAME}-only-mtu9k
40-
export NETWORK_FW_NAME=${NETWORK_NAME}-only-fw
41-
export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${NETWORK_NAME}"
42-
export TPU_TYPE=v6e-256 #<your TPU Type>
43-
export NUM_SLICES=1 #<number of TPU node-pools you want to create>
44-
export ZONE=<compute_zone>
45-
export REGION=<compute_region>
46-
```
47-
48-
2. Create the network and firewall for this cluster if it doesn’t exist yet.
49-
```shell
50-
NETWORK_NAME_1=${CLUSTER_NAME}-mtu9k-1-${ZONE}
51-
NETWORK_FW_NAME_1=${NETWORK_NAME_1}-fw-1-${ZONE}
52-
53-
# Use a custom network for better performance as well as avoid the default network to be overloaded.
54-
gcloud compute networks create ${NETWORK_NAME_1} --mtu=8896 --project=${PROJECT} --subnet-mode=auto --bgp-routing-mode=regional
55-
gcloud compute firewall-rules create ${NETWORK_FW_NAME_1} --network ${NETWORK_NAME_1} --allow tcp,icmp,udp --project=${PROJECT}
56-
57-
# Secondary subnet for multinic experience. Need custom ip routing to be different from first network’s subnet.
58-
export NETWORK_NAME_2=${CLUSTER_NAME}-privatenetwork-2-${ZONE}
59-
export SUBNET_NAME_2=${CLUSTER_NAME}-privatesubnet-2-${ZONE}
60-
export FIREWALL_RULE_NAME=${CLUSTER_NAME}-privatefirewall-2-${ZONE}
61-
export ROUTER_NAME=${CLUSTER_NAME}-network-2-${ZONE}
62-
export NAT_CONFIG=${CLUSTER_NAME}-natconfig-2-${ZONE}
63-
64-
gcloud compute networks create "${NETWORK_NAME_2}" --mtu=8896 --bgp-routing-mode=regional --subnet-mode=custom --project=$PROJECT
65-
gcloud compute networks subnets create "${SUBNET_NAME_2}" --network="${NETWORK_NAME_2}" --range=10.10.0.0/18 --region="${REGION}" --project=$PROJECT
66-
gcloud compute firewall-rules create "${FIREWALL_RULE_NAME}" --network "${NETWORK_NAME_2}" --allow tcp,icmp,udp --project="${PROJECT}"
67-
gcloud compute routers create "${ROUTER_NAME}" \
68-
--project="${PROJECT}" \
69-
--network="${NETWORK_NAME_2}" \
70-
--region="${REGION}"
71-
gcloud compute routers nats create "${NAT_CONFIG}" \
72-
--router="${ROUTER_NAME}" \
73-
--region="${REGION}" \
74-
--auto-allocate-nat-external-ips \
75-
--nat-all-subnet-ip-ranges \
76-
--project="${PROJECT}" \
77-
--enable-logging
78-
```
79-
80-
3. Create GKE cluster with TPU node-pools
81-
```shell
82-
export CLUSTER_ARGUMENTS="--enable-dataplane-v2 --enable-ip-alias --enable-multi-networking --network=${NETWORK_NAME_1} --subnetwork=${NETWORK_NAME_1}"
83-
84-
export NODE_POOL_ARGUMENTS="--additional-node-network network=${NETWORK_NAME_2},subnetwork=${SUBNET_NAME_2}"
85-
86-
python3 xpk.py cluster create --cluster $CLUSTER_NAME --cluster-cpu-machine-type=n1-standard-8 --num-slices=$NUM_SLICES --tpu-type=$TPU_TYPE --zone=$ZONE --project=$PROJECT --on-demand --custom-cluster-arguments="${CLUSTER_ARGUMENTS}" --custom-nodepool-arguments="${NODE_POOL_ARGUMENTS}" --create-vertex-tensorboard
87-
```
88-
89-
* Noted: TPU has `reserved`, `on-demand`, `spot` quota. This example used the `on-demand` quota. If you have the reserved or spot quota, please refer to this [link](https://github.com/google/xpk?tab=readme-ov-file#cluster-create).
90-
* If you want to check what quota you have, please refer to this [link](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#ensure-quota).
91-
* You should be able to see your GKE cluster similar to this once it is created successfully:![image](https://github.com/user-attachments/assets/60743411-5ee5-4391-bb0e-7ffba4d91c1d)
92-
93-
4. Performance Daemonset
94-
```shell
95-
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/9ff340f07f70be0130454f9e7238551587242b75/scripts/network-setup/v6e-network-optimization.yaml
96-
```
97-
98-
5. Test your GKE cluster to make sure it is usable
99-
```shell
100-
python3 xpk.py workload create \
101-
--cluster ${CLUSTER_NAME} \
102-
--workload hello-world-test \
103-
--tpu-type=${TPU_TYPE} \
104-
--num-slices=${NUM_SLICES} \
105-
--command "echo Hello World"
106-
```
107-
* You should be able to to see results like this: ![image](https://github.com/user-attachments/assets/c33010a6-e109-411e-8fb5-afb4edb3fa72)
108-
109-
6. You can also check your workload status with the following command:
110-
```shell
111-
python3 xpk.py workload list --cluster ${CLUSTER_NAME}
112-
```
113-
7. For more information about XPK, please refer to this [link](https://github.com/google/xpk).
114-
115-
## GKE Cluster Deletion
116-
You can use the following command to delete GKE cluster:
117-
```shell
118-
export CLUSTER_NAME=v6e-demo #<your_cluster_name>
36+
Trillium GKE clusters can be [created](https://cloud.google.com/tpu/docs/v6e-intro#create_an_xpk_cluster_with_multi-nic_support) and
37+
[deleted](https://cloud.google.com/tpu/docs/v6e-intro#delete_xpk_cluster) by following the public GCP documentation.
11938

120-
python3 xpk.py cluster delete --cluster $CLUSTER_NAME
121-
```
39+
> Note: in order to run the training and microbenchmarks tpu-recipes, you should not need to run sections outside of
40+
`Create an XPK cluster with multi-NIC support` when creating your cluster. You can skip the following sections like `Framework setup`.

0 commit comments

Comments
 (0)