Skip to content

Commit 6e4d488

Browse files
Merge pull request #1746 from geremyCohen/ollama_on_gke
Ollama on gke
2 parents 2d03b88 + 4758fe4 commit 6e4d488

File tree

14 files changed

+763
-1
lines changed

14 files changed

+763
-1
lines changed

.gitignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,10 @@ startup.sh
1414
nohup.out
1515

1616
venv/
17-
z_local_saved/
17+
z_local_saved/
18+
/.idea/
19+
/tools/.python-version
20+
/.python-version
21+
*.iml
22+
*.xml
23+
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: Spin up the GKE Cluster
3+
weight: 2
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Overview
10+
11+
Arm CPUs are widely used in traditional AI/ML use cases. In this Learning Path, you learn how to run [Ollama](https://ollama.com/) on Arm-based CPUs in a hybrid architecture (amd64 and arm64) K8s cluster.
12+
13+
To demonstrate this as a real life scenario, you're going to bring up an initial Kubernetes cluster (depicted as "*1. Inital Cluster (amd64)*" in the image below) with an amd64 node running an Ollama Deployment and Service.
14+
15+
Next, as depicted by "*2. Hybrid Cluster amd64/arm64*", you'll add the arm64 node, and apply an arm64 Deployment and Service to it, so that you can now test both architectures together, and separately, to investigate performance.
16+
17+
When satisfied with the arm64 performance over amd64, its easy to delete the amd64-specific node, deployment, and service, to complete the migration, as depicted in "*3. Migrated Cluster (arm64)*".
18+
19+
![Project Overview](images/general_flow.png)
20+
21+
Once you've seen how easy it is to add an arm64 to an existing cluster, you can apply the knowledge to experiment with the value arm64 brings to other workloads in your environment as you see fit.
22+
23+
### Create the cluster
24+
25+
1. From within the GCP Console, navigate to [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes/list/overview) and click *Create*.
26+
27+
2. Select *Standard*->*Configure*
28+
29+
![Select and Configure Cluster Type](images/select_standard.png)
30+
31+
The *Cluster basics* tab appears.
32+
33+
3. For *Name*, enter *ollama-on-multiarch*
34+
4. For *Region*, enter *us-central1*.
35+
36+
![Select and Configure Cluster Type](images/cluster_basics.png)
37+
38+
{{% notice Note %}}
39+
Although this will work in all regions and zones where C4 and C4a instance types are supported, for this demo, we use *us-central1* and *us-central1-1a* regions and zones. In addition, with simplicity and cost savings in mind, only one node per architecture is used.
40+
{{% /notice %}}
41+
42+
5. Click on *NODE POOLS*->*default-pool*
43+
6. For *Name*, enter *amd64-pool*
44+
7. For size, enter *1*
45+
8. Select *Specify node locations*, and select *us-central1-a*
46+
47+
![Configure amd64 Node pool](images/x86-node-pool.png)
48+
49+
50+
8. Click on *NODE POOLS*->*Nodes*
51+
9. For *Series*, select *C4*
52+
10. For *Machine Type*, select *c4-standard-4*
53+
54+
{{% notice Note %}}
55+
We've chosen node types that will support one pod per node. If you wish to run multiple pods per mode, assume each node should provide ~10GB per pod.
56+
{{% /notice %}}
57+
58+
![Configure amd64 node type](images/configure-x86-note-type.png)
59+
60+
11. *Click* the *Create* button at the bottom of the screen.
61+
62+
It will take a few moments, but when the green checkmark is showing next to the *ollama-on-multiarch* cluster, you're ready to continue to test your connection to the cluster.
63+
64+
### Connect to the cluster
65+
66+
{{% notice Note %}}
67+
The following assumes you have gcloud and kubectl already installed. If not, please follow the instructions on the first page under "Prerequisites".
68+
{{% /notice %}}
69+
70+
You'll first setup your newly created K8s cluster credentials using the gcloud utility. Enter the following in your command prompt (or cloud shell), and make sure to replace "YOUR_PROJECT_ID" with the ID of your GCP project:
71+
72+
```bash
73+
export ZONE=us-central1
74+
export CLUSTER_NAME=ollama-on-multiarch
75+
export PROJECT_ID=YOUR_PROJECT_ID
76+
gcloud container clusters get-credentials $CLUSTER_NAME --zone $ZONE --project $PROJECT_ID
77+
```
78+
If you get the message:
79+
80+
```commandline
81+
CRITICAL: ACTION REQUIRED: gke-gcloud-auth-plugin, which is needed for continued use of kubectl, was not found or is not executable. Install gke-gcloud-auth-plugin for use with kubectl by following https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin
82+
```
83+
This command should help resolve it:
84+
85+
```bash
86+
gcloud components install gke-gcloud-auth-plugin
87+
```
88+
Finally, test the connection to the cluster with this command:
89+
90+
```commandline
91+
kubectl cluster-info
92+
```
93+
If you receive a non-error response, you're successfully connected to the k8s cluster!
Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
---
2+
title: Deploy Ollama amd64 to the cluster
3+
weight: 3
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
## Overview
10+
11+
Any easy way to experiment with Arm64 nodes in your K8s cluster is to deploy Arm64 nodes and pods alongside your existing amd64 node and pods. In this section of the tutorial, you'll bootstrap the cluster with Ollama on amd64, to simulate an "existing" K8s cluster running Ollama.
12+
13+
### Deployment and Service
14+
15+
16+
1. Copy the following YAML, and save it to a file called *namespace.yaml*:
17+
18+
```yaml
19+
apiVersion: v1
20+
kind: Namespace
21+
metadata:
22+
name: ollama
23+
```
24+
25+
When the above is applied, a new K8s namespace named *ollama* will be created. This is where all the K8s object created under this tutorial will live.
26+
27+
2. Copy the following YAML, and save it to a file called *amd64_ollama.yaml*:
28+
29+
```yaml
30+
apiVersion: apps/v1
31+
kind: Deployment
32+
metadata:
33+
name: ollama-amd64-deployment
34+
labels:
35+
app: ollama-multiarch
36+
namespace: ollama
37+
spec:
38+
replicas: 1
39+
selector:
40+
matchLabels:
41+
arch: amd64
42+
template:
43+
metadata:
44+
labels:
45+
app: ollama-multiarch
46+
arch: amd64
47+
spec:
48+
nodeSelector:
49+
kubernetes.io/arch: amd64
50+
containers:
51+
- image: ollama/ollama:0.6.1
52+
name: ollama-multiarch
53+
ports:
54+
- containerPort: 11434
55+
name: http
56+
protocol: TCP
57+
volumeMounts:
58+
- mountPath: /root/.ollama
59+
name: ollama-data
60+
volumes:
61+
- emptyDir: {}
62+
name: ollama-data
63+
---
64+
apiVersion: v1
65+
kind: Service
66+
metadata:
67+
name: ollama-amd64-svc
68+
namespace: ollama
69+
spec:
70+
sessionAffinity: None
71+
ports:
72+
- nodePort: 30668
73+
port: 80
74+
protocol: TCP
75+
targetPort: 11434
76+
selector:
77+
arch: amd64
78+
type: LoadBalancer
79+
```
80+
81+
When the above is applied:
82+
83+
* A new Deployment called *ollama-amd64-deployment* is created. This deployment pulls a multi-architectural (both amd64 and arm64) image [ollama image from Dockerhub](https://hub.docker.com/layers/ollama/ollama/0.6.1/images/sha256-28b909914d4e77c96b1c57dea199c60ec12c5050d08ed764d9c234ba2944be63).
84+
85+
Of particular interest is the *nodeSelector* *kubernetes.io/arch*, with the value of *amd64*. This will ensure that this deployment only runs on amd64-based nodes, utilizing the amd64 version of the Ollama container image.
86+
87+
* A new load balancer Service *ollama-amd64-svc* is created, which targets all pods with the *arch: amd64* label (our amd64 deployment creates these pods.)
88+
89+
A *sessionAffinity* tag was added to this Service to remove sticky connections to the target pods; this removes persistent connections to the same pod on each request.
90+
91+
### Apply the amd64 Deployment and Service
92+
93+
1. Run the following command to apply the namespace, deployment, and service definitions:
94+
95+
```bash
96+
kubectl apply -f namespace.yaml
97+
kubectl apply -f amd64_ollama.yaml
98+
```
99+
100+
You should get the following responses back:
101+
102+
```bash
103+
namespace/ollama created
104+
deployment.apps/ollama-amd64-deployment created
105+
service/ollama-amd64-svc created
106+
```
107+
2. Optionally, set the *default Namespace* to *ollama* so you don't need to specify the namespace each time, by entering the following:
108+
109+
```bash
110+
config set-context --current --namespace=ollama
111+
```
112+
113+
3. Get the status of the pods, and the services, by running the following:
114+
115+
```commandline
116+
kubectl get nodes,pods,svc -nollama
117+
```
118+
119+
Your output should be similar to the following, showing one node, one pod, and one service:
120+
121+
```commandline
122+
NAME STATUS ROLES AGE VERSION
123+
node/gke-ollama-on-arm-amd64-pool-62c0835c-93ht Ready <none> 77m v1.31.6-gke.1020000
124+
125+
NAME READY STATUS RESTARTS AGE
126+
pod/ollama-amd64-deployment-cbfc4b865-msftf 1/1 Running 0 16m
127+
128+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
129+
service/ollama-amd64-svc LoadBalancer 1.2.2.3 1.2.3.4 80:30668/TCP 16m
130+
```
131+
132+
When the pods show *Running* and the service shows a valid *External IP*, we're ready to test the Ollama amd64 service!
133+
134+
### Test the Ollama on amd64 web service
135+
136+
{{% notice Note %}}
137+
The following utility, modelUtil.sh, is provided as a convenient utility to accompany this learning path. It's simply a shell wrapper for kubectl, utilizing the utilities [curl](https://curl.se/), [jq](https://jqlang.org/), [bc](https://www.gnu.org/software/bc/), and [stdbuf](https://www.gnu.org/software/coreutils/manual/html_node/stdbuf-invocation.html). Make sure you have these shell utilities installed before running.
138+
{{% /notice %}}
139+
140+
141+
4. Copy the following shell script, and save it to a file called *model_util.sh*:
142+
143+
```bash
144+
#!/bin/bash
145+
146+
echo
147+
148+
# https://ollama-operator.ayaka.io/pages/en/guide/supported-models
149+
model_name="llama3.2"
150+
#model_name="mistral"
151+
#model_name="dolphin-phi"
152+
153+
#prompt="Name the two closest stars to earth"
154+
prompt="Create a sentence that makes sense in the English language, with as many palindromes in it as possible"
155+
156+
echo "Server response:"
157+
158+
get_service_ip() {
159+
arch=$1
160+
svc_name="ollama-${arch}-svc"
161+
kubectl -nollama get svc $svc_name -o jsonpath="{.status.loadBalancer.ingress[*]['ip', 'hostname']}"
162+
}
163+
164+
infer_request() {
165+
svc_ip=$1
166+
temp=$(mktemp)
167+
stdbuf -oL curl -s $temp http://$svc_ip/api/generate -d '{
168+
"model": "'"$model_name"'",
169+
"prompt": "'"$prompt"'"
170+
}' | tee $temp
171+
172+
duration=$(grep eval_count $temp | jq -r '.eval_duration')
173+
count=$(grep eval_count $temp | jq -r '.eval_count')
174+
175+
if [[ -n "$duration" && -n "$count" ]]; then
176+
quotient=$(echo "scale=2;1000000000*$count/$duration" | bc)
177+
echo "Tokens per second: $quotient"
178+
else
179+
echo "Error: eval_count or eval_duration not found in response."
180+
fi
181+
182+
rm $temp
183+
}
184+
185+
pull_model() {
186+
svc_ip=$1
187+
curl http://$svc_ip/api/pull -d '{
188+
"model": "'"$model_name"'"
189+
}'
190+
}
191+
192+
hello_request() {
193+
svc_ip=$1
194+
curl http://$svc_ip/
195+
}
196+
197+
run_action() {
198+
arch=$1
199+
action=$2
200+
201+
svc_ip=$(get_service_ip $arch)
202+
echo "Using service endpoint $svc_ip for $action on $arch"
203+
204+
case $action in
205+
infer)
206+
infer_request $svc_ip
207+
;;
208+
pull)
209+
pull_model $svc_ip
210+
;;
211+
hello)
212+
hello_request $svc_ip
213+
;;
214+
*)
215+
echo "Invalid second argument. Use 'infer', 'pull', or 'hello'."
216+
exit 1
217+
;;
218+
esac
219+
}
220+
221+
case $1 in
222+
arm64|amd64|multiarch)
223+
run_action $1 $2
224+
;;
225+
*)
226+
echo "Invalid first argument. Use 'arm64', 'amd64', or 'multiarch'."
227+
exit 1
228+
;;
229+
esac
230+
231+
echo -e "\n\nPod log output:"
232+
echo;kubectl logs --timestamps -l app=ollama-multiarch -nollama --prefix | sort -k2 | cut -d " " -f 1,2 | tail -1
233+
echo
234+
```
235+
236+
5. Make it executable with the following command:
237+
238+
```bash
239+
chmod 755 model_util.sh
240+
```
241+
242+
This shell script conveniently bundles many test and logging commands into a single place, making it easy to test, troubleshoot, and view the services we expose in this tutorial.
243+
244+
6. Run the following to make an HTTP request to the amd64 Ollama service on port 80:
245+
246+
```commandline
247+
./model_util.sh amd64 hello
248+
```
249+
250+
You should get back the HTTP response, as well as the logline from the pod that served it:
251+
252+
```commandline
253+
Server response:
254+
Using service endpoint 34.55.25.101 for hello on amd64
255+
Ollama is running
256+
257+
Pod log output:
258+
259+
[pod/ollama-amd64-deployment-cbfc4b865-msftf/ollama-multiarch] 2025-03-25T21:13:49.022522588Z
260+
```
261+
262+
Success is defined specifically by seeing the words "Ollama is running". If you see this in your output, then congrats, you've successfully bootstrapped your GKE cluster with an amd64 node, running a Deployment with the Ollama multi-architecture container instance!
263+
264+
Next, we'll do the same thing, but with an Arm node.

0 commit comments

Comments
 (0)