|
| 1 | +--- |
| 2 | +title: Deploy Ollama amd64 to the cluster |
| 3 | +weight: 3 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +Any easy way to experiment with Arm64 nodes in your K8s cluster is to deploy Arm64 nodes and pods alongside your existing amd64 node and pods. In this section of the tutorial, you'll bootstrap the cluster with Ollama on amd64, to simulate an "existing" K8s cluster running Ollama. |
| 12 | + |
| 13 | +### Deployment and Service |
| 14 | + |
| 15 | + |
| 16 | +1. Copy the following YAML, and save it to a file called *namespace.yaml*: |
| 17 | + |
| 18 | +```yaml |
| 19 | +apiVersion: v1 |
| 20 | +kind: Namespace |
| 21 | +metadata: |
| 22 | + name: ollama |
| 23 | +``` |
| 24 | +
|
| 25 | +When the above is applied, a new K8s namespace named *ollama* will be created. This is where all the K8s object created under this tutorial will live. |
| 26 | +
|
| 27 | +2. Copy the following YAML, and save it to a file called *amd64_ollama.yaml*: |
| 28 | +
|
| 29 | +```yaml |
| 30 | +apiVersion: apps/v1 |
| 31 | +kind: Deployment |
| 32 | +metadata: |
| 33 | + name: ollama-amd64-deployment |
| 34 | + labels: |
| 35 | + app: ollama-multiarch |
| 36 | + namespace: ollama |
| 37 | +spec: |
| 38 | + replicas: 1 |
| 39 | + selector: |
| 40 | + matchLabels: |
| 41 | + arch: amd64 |
| 42 | + template: |
| 43 | + metadata: |
| 44 | + labels: |
| 45 | + app: ollama-multiarch |
| 46 | + arch: amd64 |
| 47 | + spec: |
| 48 | + nodeSelector: |
| 49 | + kubernetes.io/arch: amd64 |
| 50 | + containers: |
| 51 | + - image: ollama/ollama:0.6.1 |
| 52 | + name: ollama-multiarch |
| 53 | + ports: |
| 54 | + - containerPort: 11434 |
| 55 | + name: http |
| 56 | + protocol: TCP |
| 57 | + volumeMounts: |
| 58 | + - mountPath: /root/.ollama |
| 59 | + name: ollama-data |
| 60 | + volumes: |
| 61 | + - emptyDir: {} |
| 62 | + name: ollama-data |
| 63 | +--- |
| 64 | +apiVersion: v1 |
| 65 | +kind: Service |
| 66 | +metadata: |
| 67 | + name: ollama-amd64-svc |
| 68 | + namespace: ollama |
| 69 | +spec: |
| 70 | + sessionAffinity: None |
| 71 | + ports: |
| 72 | + - nodePort: 30668 |
| 73 | + port: 80 |
| 74 | + protocol: TCP |
| 75 | + targetPort: 11434 |
| 76 | + selector: |
| 77 | + arch: amd64 |
| 78 | + type: LoadBalancer |
| 79 | +``` |
| 80 | +
|
| 81 | +When the above is applied: |
| 82 | +
|
| 83 | +* A new Deployment called *ollama-amd64-deployment* is created. This deployment pulls a multi-architectural (both amd64 and arm64) image [ollama image from Dockerhub](https://hub.docker.com/layers/ollama/ollama/0.6.1/images/sha256-28b909914d4e77c96b1c57dea199c60ec12c5050d08ed764d9c234ba2944be63). |
| 84 | +
|
| 85 | +Of particular interest is the *nodeSelector* *kubernetes.io/arch*, with the value of *amd64*. This will ensure that this deployment only runs on amd64-based nodes, utilizing the amd64 version of the Ollama container image. |
| 86 | +
|
| 87 | +* A new load balancer Service *ollama-amd64-svc* is created, which targets all pods with the *arch: amd64* label (our amd64 deployment creates these pods.) |
| 88 | +
|
| 89 | +A *sessionAffinity* tag was added to this Service to remove sticky connections to the target pods; this removes persistent connections to the same pod on each request. |
| 90 | +
|
| 91 | +### Apply the amd64 Deployment and Service |
| 92 | +
|
| 93 | +1. Run the following command to apply the namespace, deployment, and service definitions: |
| 94 | +
|
| 95 | +```bash |
| 96 | +kubectl apply -f namespace.yaml |
| 97 | +kubectl apply -f amd64_ollama.yaml |
| 98 | +``` |
| 99 | + |
| 100 | +You should get the following responses back: |
| 101 | + |
| 102 | +```bash |
| 103 | +namespace/ollama created |
| 104 | +deployment.apps/ollama-amd64-deployment created |
| 105 | +service/ollama-amd64-svc created |
| 106 | +``` |
| 107 | +2. Optionally, set the *default Namespace* to *ollama* so you don't need to specify the namespace each time, by entering the following: |
| 108 | + |
| 109 | +```bash |
| 110 | +config set-context --current --namespace=ollama |
| 111 | +``` |
| 112 | + |
| 113 | +3. Get the status of the pods, and the services, by running the following: |
| 114 | + |
| 115 | +```commandline |
| 116 | +kubectl get nodes,pods,svc -nollama |
| 117 | +``` |
| 118 | + |
| 119 | +Your output should be similar to the following, showing one node, one pod, and one service: |
| 120 | + |
| 121 | +```commandline |
| 122 | +NAME STATUS ROLES AGE VERSION |
| 123 | +node/gke-ollama-on-arm-amd64-pool-62c0835c-93ht Ready <none> 77m v1.31.6-gke.1020000 |
| 124 | +
|
| 125 | +NAME READY STATUS RESTARTS AGE |
| 126 | +pod/ollama-amd64-deployment-cbfc4b865-msftf 1/1 Running 0 16m |
| 127 | +
|
| 128 | +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE |
| 129 | +service/ollama-amd64-svc LoadBalancer 1.2.2.3 1.2.3.4 80:30668/TCP 16m |
| 130 | +``` |
| 131 | + |
| 132 | +When the pods show *Running* and the service shows a valid *External IP*, we're ready to test the Ollama amd64 service! |
| 133 | + |
| 134 | +### Test the Ollama on amd64 web service |
| 135 | + |
| 136 | +{{% notice Note %}} |
| 137 | +The following utility, modelUtil.sh, is provided as a convenient utility to accompany this learning path. It's simply a shell wrapper for kubectl, utilizing the utilities [curl](https://curl.se/), [jq](https://jqlang.org/), [bc](https://www.gnu.org/software/bc/), and [stdbuf](https://www.gnu.org/software/coreutils/manual/html_node/stdbuf-invocation.html). Make sure you have these shell utilities installed before running. |
| 138 | +{{% /notice %}} |
| 139 | + |
| 140 | + |
| 141 | +4. Copy the following shell script, and save it to a file called *model_util.sh*: |
| 142 | + |
| 143 | +```bash |
| 144 | +#!/bin/bash |
| 145 | + |
| 146 | +echo |
| 147 | + |
| 148 | +# https://ollama-operator.ayaka.io/pages/en/guide/supported-models |
| 149 | +model_name="llama3.2" |
| 150 | +#model_name="mistral" |
| 151 | +#model_name="dolphin-phi" |
| 152 | + |
| 153 | +#prompt="Name the two closest stars to earth" |
| 154 | +prompt="Create a sentence that makes sense in the English language, with as many palindromes in it as possible" |
| 155 | + |
| 156 | +echo "Server response:" |
| 157 | + |
| 158 | +get_service_ip() { |
| 159 | + arch=$1 |
| 160 | + svc_name="ollama-${arch}-svc" |
| 161 | + kubectl -nollama get svc $svc_name -o jsonpath="{.status.loadBalancer.ingress[*]['ip', 'hostname']}" |
| 162 | +} |
| 163 | + |
| 164 | +infer_request() { |
| 165 | + svc_ip=$1 |
| 166 | + temp=$(mktemp) |
| 167 | + stdbuf -oL curl -s $temp http://$svc_ip/api/generate -d '{ |
| 168 | + "model": "'"$model_name"'", |
| 169 | + "prompt": "'"$prompt"'" |
| 170 | + }' | tee $temp |
| 171 | + |
| 172 | + duration=$(grep eval_count $temp | jq -r '.eval_duration') |
| 173 | + count=$(grep eval_count $temp | jq -r '.eval_count') |
| 174 | + |
| 175 | + if [[ -n "$duration" && -n "$count" ]]; then |
| 176 | + quotient=$(echo "scale=2;1000000000*$count/$duration" | bc) |
| 177 | + echo "Tokens per second: $quotient" |
| 178 | + else |
| 179 | + echo "Error: eval_count or eval_duration not found in response." |
| 180 | + fi |
| 181 | + |
| 182 | + rm $temp |
| 183 | +} |
| 184 | + |
| 185 | +pull_model() { |
| 186 | + svc_ip=$1 |
| 187 | + curl http://$svc_ip/api/pull -d '{ |
| 188 | + "model": "'"$model_name"'" |
| 189 | + }' |
| 190 | +} |
| 191 | + |
| 192 | +hello_request() { |
| 193 | + svc_ip=$1 |
| 194 | + curl http://$svc_ip/ |
| 195 | +} |
| 196 | + |
| 197 | +run_action() { |
| 198 | + arch=$1 |
| 199 | + action=$2 |
| 200 | + |
| 201 | + svc_ip=$(get_service_ip $arch) |
| 202 | + echo "Using service endpoint $svc_ip for $action on $arch" |
| 203 | + |
| 204 | + case $action in |
| 205 | + infer) |
| 206 | + infer_request $svc_ip |
| 207 | + ;; |
| 208 | + pull) |
| 209 | + pull_model $svc_ip |
| 210 | + ;; |
| 211 | + hello) |
| 212 | + hello_request $svc_ip |
| 213 | + ;; |
| 214 | + *) |
| 215 | + echo "Invalid second argument. Use 'infer', 'pull', or 'hello'." |
| 216 | + exit 1 |
| 217 | + ;; |
| 218 | + esac |
| 219 | +} |
| 220 | + |
| 221 | +case $1 in |
| 222 | + arm64|amd64|multiarch) |
| 223 | + run_action $1 $2 |
| 224 | + ;; |
| 225 | + *) |
| 226 | + echo "Invalid first argument. Use 'arm64', 'amd64', or 'multiarch'." |
| 227 | + exit 1 |
| 228 | + ;; |
| 229 | +esac |
| 230 | + |
| 231 | +echo -e "\n\nPod log output:" |
| 232 | +echo;kubectl logs --timestamps -l app=ollama-multiarch -nollama --prefix | sort -k2 | cut -d " " -f 1,2 | tail -1 |
| 233 | +echo |
| 234 | +``` |
| 235 | + |
| 236 | +5. Make it executable with the following command: |
| 237 | + |
| 238 | +```bash |
| 239 | +chmod 755 model_util.sh |
| 240 | +``` |
| 241 | + |
| 242 | +This shell script conveniently bundles many test and logging commands into a single place, making it easy to test, troubleshoot, and view the services we expose in this tutorial. |
| 243 | + |
| 244 | +6. Run the following to make an HTTP request to the amd64 Ollama service on port 80: |
| 245 | + |
| 246 | +```commandline |
| 247 | +./model_util.sh amd64 hello |
| 248 | +``` |
| 249 | + |
| 250 | +You should get back the HTTP response, as well as the logline from the pod that served it: |
| 251 | + |
| 252 | +```commandline |
| 253 | +Server response: |
| 254 | +Using service endpoint 34.55.25.101 for hello on amd64 |
| 255 | +Ollama is running |
| 256 | +
|
| 257 | +Pod log output: |
| 258 | +
|
| 259 | +[pod/ollama-amd64-deployment-cbfc4b865-msftf/ollama-multiarch] 2025-03-25T21:13:49.022522588Z |
| 260 | +``` |
| 261 | + |
| 262 | +Success is defined specifically by seeing the words "Ollama is running". If you see this in your output, then congrats, you've successfully bootstrapped your GKE cluster with an amd64 node, running a Deployment with the Ollama multi-architecture container instance! |
| 263 | + |
| 264 | +Next, we'll do the same thing, but with an Arm node. |
0 commit comments