Skip to content

Commit 97b8cff

Browse files
authored
Validate CPU and memory limits against the machine type. (#808)
* Validate CPU and memory limits against the machine type.
1 parent cde0286 commit 97b8cff

10 files changed

+565
-36
lines changed

goldens.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ goldens:
55
command: python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --enable-autoprovisioning --cluster=golden-cluster --tpu-type=tpu7x-8 --on-demand --dry-run
66
"Basic cluster create":
77
command: python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --spot --dry-run
8+
"Cluster create with CPU and memory limits below capacity":
9+
command: python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --spot --cpu-limit=1 --memory-limit=1Mi --dry-run
10+
"Cluster create with CPU and memory limits above capacity":
11+
command: python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --spot --cpu-limit=20 --memory-limit=1Gi --dry-run
812
"Cluster create with gb200-4":
913
command: python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --device-type=gb200-4 --reservation=golden-reservation --dry-run
1014
"Cluster create private":
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
$ python3 xpk.py cluster create --project=golden-project --zone=us-central1-a --cluster=golden-cluster --tpu-type=tpu7x-8 --spot --cpu-limit=20 --memory-limit=1Gi --dry-run
2+
[XPK] Starting xpk v0.14.3
3+
[XPK] Starting cluster create for cluster golden-cluster:
4+
[XPK] Working on golden-project and us-central1-a
5+
[XPK] Task: `Determine server supported GKE versions for default rapid gke version` is implemented by the following command not running since it is a dry run.
6+
gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.defaultVersion)"
7+
[XPK] Task: `Determine server supported GKE versions for valid versions` is implemented by the following command not running since it is a dry run.
8+
gcloud container get-server-config --project=golden-project --region=us-central1 --flatten="channels" --filter="channels.channel=RAPID" --format="value(channels.validVersions)"
9+
[XPK] Task: `Find if Cluster Exists` is implemented by the following command not running since it is a dry run.
10+
gcloud container clusters list --project=golden-project --filter=location~"us-central1.*" --format="csv[no-heading](name)"
11+
[XPK] Task: `GKE Cluster Create` is implemented by the following command not running since it is a dry run.
12+
gcloud beta container clusters create golden-cluster --project=golden-project --region=us-central1 --node-locations=us-central1-a --cluster-version=0 --machine-type=e2-standard-16 --enable-autoscaling --total-min-nodes 1 --total-max-nodes 1000 --num-nodes 6 --enable-dns-access --autoscaling-profile=optimize-utilization --labels=gke_product_type=xpk --location-policy=BALANCED --scopes=storage-full,gke-default
13+
[XPK] Task: `Find cluster region or zone` is implemented by the following command not running since it is a dry run.
14+
gcloud container clusters list --project=golden-project --filter=name=golden-cluster --format="value(location)"
15+
[XPK] Task: `Check if Private Nodes is enabled in cluster.` is implemented by the following command not running since it is a dry run.
16+
gcloud container clusters describe golden-cluster --project=golden-project --location=us-central1 --format="value(privateClusterConfig.enablePrivateNodes)"
17+
[XPK] Private Nodes is not enabled on the cluster.
18+
[XPK] Cluster is public and no need to authorize networks.
19+
[XPK] Try 1: get-credentials-dns-endpoint to cluster golden-cluster
20+
[XPK] Task: `get-credentials-dns-endpoint to cluster golden-cluster` is implemented by the following command not running since it is a dry run.
21+
gcloud container clusters get-credentials golden-cluster --location=us-central1 --dns-endpoint --project=golden-project && kubectl config view && kubectl config set-context --current --namespace=default
22+
[XPK] Testing credentials with kubectl...
23+
[XPK] Task: `kubectl get pods` is implemented by the following command not running since it is a dry run.
24+
kubectl get pods
25+
[XPK] Credentials test succeeded.
26+
[XPK] Finished get-credentials and kubectl setup.
27+
[XPK] Task: 'Checking CoreDNS deployment existence' in progress for namespace: kube-system
28+
[XPK] Task: `Check CoreDNS deployment in kube-system` is implemented by the following command not running since it is a dry run.
29+
kubectl get deployment coredns -n kube-system
30+
[XPK] Now verifying CoreDNS readiness...
31+
[XPK] Task: `Waiting for kubeDNS to be checked.` is implemented by the following command not running since it is a dry run.
32+
kubectl get deployment kube-dns -n kube-system --ignore-not-found
33+
[XPK] kube-dns deployment not found.
34+
[XPK] Verifying if CoreDNS is available...
35+
[XPK] Task: `Wait for coredns available` is implemented by the following command not running since it is a dry run.
36+
kubectl wait deployment/coredns --for=condition=Available=true --namespace=kube-system --timeout=240s
37+
[XPK] CoreDNS has successfully started and passed verification.
38+
[XPK] CoreDNS deployment 'coredns' found in namespace 'kube-system'.
39+
[XPK] Skipping CoreDNS deployment since it already exists.
40+
[XPK] Task: `Determine current gke master version` is implemented by the following command not running since it is a dry run.
41+
gcloud beta container clusters describe golden-cluster --location us-central1 --project golden-project --format="value(currentMasterVersion)"
42+
[XPK] Creating 1 node pool or pools of tpu7x-8
43+
We assume that the underlying system is: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, requires_workload_policy=True)
44+
[XPK] Task: `Get All Node Pools` is implemented by the following command not running since it is a dry run.
45+
gcloud beta container node-pools list --cluster golden-cluster --project=golden-project --location=us-central1 --format="csv[no-heading](name)"
46+
[XPK] Creating 1 node pool or pools of tpu7x-8
47+
Underlyingly, we assume that means: SystemCharacteristics(topology='2x2x1', vms_per_slice=1, gke_accelerator='tpu7x', gce_machine_type='tpu7x-standard-4t', chips_per_vm=4, accelerator_type=TPU, device_type='tpu7x-8', supports_sub_slicing=False, requires_workload_policy=True)
48+
[XPK] Task: `Get Node Pool Zone` is implemented by the following command not running since it is a dry run.
49+
gcloud beta container node-pools describe 0 --cluster golden-cluster --project=golden-project --location=us-central1 --format="value(locations)"
50+
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
51+
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
52+
[XPK] Existing node pool names ['0']
53+
[XPK] Task: `Retrieve resource policy` is implemented by the following command not running since it is a dry run.
54+
gcloud compute resource-policies describe tpu7x-8-2x2x1-placement-policy --project=golden-project --region=us-central1
55+
[XPK] To complete NodepoolCreate-golden-cluster-np-0 we are executing gcloud beta container node-pools create golden-cluster-np-0 --location=us-central1 --cluster=golden-cluster --project=golden-project --node-locations=us-central1-a --machine-type=tpu7x-standard-4t --host-maintenance-interval=AS_NEEDED --spot --placement-policy=tpu7x-8-2x2x1-placement-policy --enable-gvnic --node-version=0 --num-nodes=1 --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform" --max-pods-per-node 15
56+
[XPK] Breaking up a total of 1 commands into 1 batches
57+
[XPK] Pretending all the jobs succeeded
58+
[XPK] Create or delete node pool request complete.
59+
[XPK] Creating ConfigMap for cluster
60+
[XPK] Breaking up a total of 2 commands into 1 batches
61+
[XPK] Pretending all the jobs succeeded
62+
[XPK] Enabling the jobset API on our cluster, to be deprecated when Jobset is globally available
63+
[XPK] Try 1: Install Jobset on golden-cluster
64+
[XPK] Task: `Install Jobset on golden-cluster` is implemented by the following command not running since it is a dry run.
65+
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.8.0/manifests.yaml
66+
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
67+
kubectl get node --no-headers | wc -l
68+
[XPK] Try 1: Updating jobset Controller Manager resources
69+
[XPK] Task: `Updating jobset Controller Manager resources` is implemented by the following command not running since it is a dry run.
70+
kubectl apply -f 1b31e624e490f9c8c4ef4e369f08d3fa467990af5a261e4405bd045265d70e95
71+
[XPK] Try 1: Install PathwaysJob on golden-cluster
72+
[XPK] Task: `Install PathwaysJob on golden-cluster` is implemented by the following command not running since it is a dry run.
73+
kubectl apply --server-side -f https://github.com/google/pathways-job/releases/download/v0.1.4/install.yaml
74+
[XPK] Enabling Kueue on the cluster
75+
[XPK] Task: `Get kueue version on server` is implemented by the following command not running since it is a dry run.
76+
kubectl get deployment kueue-controller-manager -n kueue-system -o jsonpath='{.spec.template.spec.containers[0].image}'
77+
[XPK] Installing Kueue version v0.14.3...
78+
[XPK] Try 1: Install Kueue
79+
[XPK] Task: `Install Kueue` is implemented by the following command not running since it is a dry run.
80+
kubectl apply --server-side --force-conflicts -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.14.3/manifests.yaml
81+
[XPK] Task: `Wait for Kueue to be available` is implemented by the following command not running since it is a dry run.
82+
kubectl wait deploy/kueue-controller-manager -n kueue-system --for=condition=available --timeout=10m
83+
[XPK] Task: `Get vCPU and memory capacity for machine type` is implemented by the following command not running since it is a dry run.
84+
gcloud compute machine-types describe tpu7x-standard-4t --project=golden-project --zone=us-central1-a --format='value(guestCpus,memoryMb)'
85+
[XPK] The CPU limit is above the available capacity. We will set CPU limit to 10.
86+
[XPK] The memory limit is above the available capacity. We will set memory limit to 10Mi.
87+
[XPK] Applying following Kueue resources:
88+
apiVersion: kueue.x-k8s.io/v1beta1
89+
kind: ResourceFlavor
90+
metadata:
91+
name: "1xtpu7x-8"
92+
spec:
93+
nodeLabels: {"cloud.google.com/gke-tpu-accelerator": "tpu7x", "cloud.google.com/gke-tpu-topology": "2x2x1"}
94+
95+
---
96+
97+
apiVersion: kueue.x-k8s.io/v1beta1
98+
kind: AdmissionCheck
99+
metadata:
100+
name: dws-prov
101+
spec:
102+
controllerName: kueue.x-k8s.io/provisioning-request
103+
parameters:
104+
apiGroup: kueue.x-k8s.io
105+
kind: ProvisioningRequestConfig
106+
name: dws-config
107+
---
108+
apiVersion: kueue.x-k8s.io/v1beta1
109+
kind: ProvisioningRequestConfig
110+
metadata:
111+
name: dws-config
112+
spec:
113+
provisioningClassName: queued-provisioning.gke.io
114+
podSetUpdates:
115+
nodeSelector:
116+
- key: autoscaling.gke.io/provisioning-request
117+
valueFromProvisioningClassDetail: ResizeRequestName
118+
managedResources:
119+
- google.com/tpu
120+
---
121+
apiVersion: kueue.x-k8s.io/v1beta1
122+
kind: ClusterQueue
123+
metadata:
124+
name: "cluster-queue"
125+
spec:
126+
preemption:
127+
reclaimWithinCohort: Never # Don't preempt other queues in the cohort.
128+
withinClusterQueue: LowerPriority
129+
namespaceSelector: {} # match all.
130+
resourceGroups: [{'coveredResources': ['google.com/tpu', 'cpu', 'memory'], 'flavors': [{'name': '1xtpu7x-8', 'resources': [{'name': 'google.com/tpu', 'nominalQuota': 4}, {'name': 'cpu', 'nominalQuota': 10}, {'name': 'memory', 'nominalQuota': '10Mi'}]}]}]
131+
132+
---
133+
apiVersion: kueue.x-k8s.io/v1beta1
134+
kind: LocalQueue
135+
metadata:
136+
namespace: default
137+
name: multislice-queue
138+
spec:
139+
clusterQueue: cluster-queue
140+
---
141+
apiVersion: scheduling.k8s.io/v1
142+
kind: PriorityClass
143+
metadata:
144+
name: very-low
145+
value: 100
146+
globalDefault: false
147+
description: "Very Low"
148+
---
149+
apiVersion: scheduling.k8s.io/v1
150+
kind: PriorityClass
151+
metadata:
152+
name: low
153+
value: 250
154+
globalDefault: false
155+
description: "Low"
156+
---
157+
apiVersion: scheduling.k8s.io/v1
158+
kind: PriorityClass
159+
metadata:
160+
name: medium
161+
value: 500
162+
globalDefault: false
163+
description: "Medium"
164+
---
165+
apiVersion: scheduling.k8s.io/v1
166+
kind: PriorityClass
167+
metadata:
168+
name: high
169+
value: 750
170+
globalDefault: false
171+
description: "High"
172+
---
173+
apiVersion: scheduling.k8s.io/v1
174+
kind: PriorityClass
175+
metadata:
176+
name: very-high
177+
value: 1000
178+
globalDefault: false
179+
description: "Very High"
180+
[XPK] Task: `Applying Kueue Custom Resources` is implemented by the following command not running since it is a dry run.
181+
kubectl apply -f 1ea1a0b1a0ec540d8320ef2a8378363e692a8439192a8f50c4b77fe545dd0a4c
182+
[XPK] Task: `Count total nodes` is implemented by the following command not running since it is a dry run.
183+
kubectl get node --no-headers | wc -l
184+
[XPK] Try 1: Updating Kueue Controller Manager resources
185+
[XPK] Task: `Updating Kueue Controller Manager resources` is implemented by the following command not running since it is a dry run.
186+
kubectl patch deployment kueue-controller-manager -n kueue-system --type='strategic' --patch='{"spec": {"template": {"spec": {"containers": [{"name": "manager", "resources": {"limits": {"memory": "4096Mi"}}}]}}}}'
187+
[XPK] Verifying kjob installation
188+
[XPK] Task: `Verify kjob installation ` is implemented by the following command not running since it is a dry run.
189+
kubectl-kjob help
190+
[XPK] kjob found
191+
[XPK] Applying kjob CDRs
192+
[XPK] Task: `Create kjob CRDs on cluster` is implemented by the following command not running since it is a dry run.
193+
kubectl kjob printcrds | kubectl apply --server-side -f -
194+
[XPK] Creating kjob CRDs succeeded
195+
[XPK] Task: `GKE Cluster Get ConfigMap` is implemented by the following command not running since it is a dry run.
196+
kubectl get configmap golden-cluster-resources-configmap -o=custom-columns="ConfigData:data" --no-headers=true
197+
[XPK] Task: `Creating JobTemplate` is implemented by the following command not running since it is a dry run.
198+
kubectl apply -f 4abb796ed6e7c9d7256a51f13124efd989fc12ee83839bed432fcf7d64f68e61
199+
[XPK] Task: `Creating PodTemplate` is implemented by the following command not running since it is a dry run.
200+
kubectl apply -f a63aa3c4593c38ad90671fd8b067d1886f6313ad558379b364b51791aa50f4e8
201+
[XPK] Task: `Creating AppProfile` is implemented by the following command not running since it is a dry run.
202+
kubectl apply -f 1d13ddebae3c90a05ba26b312df088982dd0df0edc4f4013b88384e476c20486
203+
[XPK] GKE commands done! Resources are created.
204+
[XPK] See your GKE Cluster here: https://console.cloud.google.com/kubernetes/clusters/details/us-central1/golden-cluster/details?project=golden-project
205+
[XPK] Exiting XPK cleanly

0 commit comments

Comments
 (0)