|
| 1 | +*Last Update: 15.08.2025* |
| 2 | + |
| 3 | +<br><h1 align="center">GPU-Assisted Local LLM Serving<br>Using the Ollama Open Source Tool and Compute Cloud@Customer</h1> |
| 4 | + |
| 5 | +<a id="toc"></a> |
| 6 | +## Table of Contents |
| 7 | +1. [Introduction](#intro) |
| 8 | +2. [VM Creation and Configuration](#create) |
| 9 | +3. [Proxy and Security Settings](#proxysec) |
| 10 | +4. [Ollama Installation](#ollama) |
| 11 | +5. [Testing](#test) |
| 12 | +6. [Comparative Results](#results) |
| 13 | +6. [Useful References](#ref) |
| 14 | + |
| 15 | +<a id="intro"></a> |
| 16 | +## Introduction |
| 17 | + |
| 18 | +This article builds on the [Ollama Local LLM](cloud-infrastructure/private-cloud-and-edge/compute-cloud-at-customer/local-llm) article that can be used as a reference for implementation and resources. It also serves to demonstrate the differences when building an Ollama installation using the Compute Cloud@Customer (C3) GPU expansion option. This platform is suited to AI, computational and multimedia workloads. It supports inferencing, Retrieval Augmented Generation (RAG), training and fine-tuning for small to medium LLMs (approx 70b parameters) in a 4-GPU VM configuration. |
| 19 | + |
| 20 | +Please consult the [Compute Cloud@Customer Compute Expansion Options datasheet ](https://www.oracle.com/uk/a/ocom/docs/[email protected]) for a full description and specifications of the GPU nodes. |
| 21 | + |
| 22 | +The high level differences between a general-purpose VM and a GPU-enabled VM are: |
| 23 | +* For a GPU-enabled VM, High Performance storage is normally used for boot volumes, LLM- & RAG stores and vector databases. Balanced Performance storage may however be used where appropriate, e.g. object- and file storage. |
| 24 | +* An extended boot volume is required to accommodate required GPU drivers, SDKs, bulky system libraries, applications and development space. |
| 25 | +* GPU-specific platform images must be used to incorporate GPUs into VMs. |
| 26 | + |
| 27 | +VM shapes are available in: |
| 28 | +| Shape | Resources | |
| 29 | +|-------|-----------| |
| 30 | +| C3.VM.GPU.L40S.1 | 1 GPU<br>27 OCPUs<br>200 GB memory | |
| 31 | +| C3.VM.GPU.L40S.2 | 2 GPUs<br>54 OCPUs<br>400 GB memory | |
| 32 | +| C3.VM.GPU.L40S.3 | 3 GPUs<br>81 OCPUs<br>600 GB memory | |
| 33 | +| C3.VM.GPU.L40S.4 | 4 GPUs<br>108 OCPUs<br>800 GB memory | |
| 34 | +<br> |
| 35 | + |
| 36 | +Considerations: |
| 37 | +* A firm grasp of C3 and OCI concepts and administration is assumed. |
| 38 | +* Familiarity with Linux, in particular Oracle Linux 9 for the server configuration is assumed. |
| 39 | +* The creation and integration of a development environment is outside of the scope of this document. |
| 40 | +* Oracle Linux 9 and macOS Sequoia 15.6 clients were used for testing but Windows is however widely supported. |
| 41 | + |
| 42 | +[Back to top](#toc)<br> |
| 43 | +<br> |
| 44 | + |
| 45 | +<a id="create"></a> |
| 46 | + |
| 47 | +## VM Creation and Configuration |
| 48 | +### *References:* |
| 49 | +* *[Creating an Instance in the Oracle Cloud Infrastructure Documentation](https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/compute/creating-an-instance.htm#creating-an-instance)* |
| 50 | +* *[Creating and Attaching Block Volumes in the Oracle Cloud Infrastructure Documentation](https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/block/creating-and-attaching-block-volumes.htm)* |
| 51 | +* *[Compute Shapes (GPU Shapes) in the Oracle Cloud Infrastructure Documentation](https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/compute/compute-shapes.htm#compute-shapes__compute-shape-gpu)* |
| 52 | +<br><br> |
| 53 | + |
| 54 | +| Requirement | Specification | Remarks | |
| 55 | +|----------|----------|----------| |
| 56 | +| Shape | C3.VM.GPU.L40S.[1234] | Select either 1, 2, 3 or 4 GPUs | |
| 57 | +| Image | Oracle Linux 8/9 | Note the difference in the NVIDIA driver installation instructions | |
| 58 | +| Boot Volume | >= 150 GB on Performance Storage<br>(VPU = 20) | * Upwards of 150 GB is recommended to accommodate a large software footprint<br>* 250 GB was used for this article | |
| 59 | +| Data Volume | >= 250 GB on Performance Storage<br>(VPU = 20) | * Depends on the size of LLMs that will be used<br>* 250 GB was used for this article | |
| 60 | +| NVIDIA Drivers | Latest | Install as per NVIDIA CUDA Toolkit instructions in the above GPU Shape documentation | |
| 61 | +| Hostname | ol9-gpu || |
| 62 | +| Public IP | Yes || |
| 63 | +<br> |
| 64 | + |
| 65 | +### *Boot Volume Creation Specification:* |
| 66 | +<p><img src="./images/Screenshot 2025-08-13 at 09.33.08.png" title="Boot Volume Specification" width="75%" style="float:right"/></p> |
| 67 | + |
| 68 | +### *Next Steps:* |
| 69 | +[1] Connect to the VM<br> |
| 70 | +[2] Extend the root filesystem:<br> |
| 71 | +```sudo /usr/libexec/oci-growfs -y``` |
| 72 | +<p><img src="./images/Screenshot 2025-08-13 at 21.49.30.png" title="Before resize" width="75%" style="float:right"/></p> |
| 73 | +<p><img src="./images/Screenshot 2025-08-13 at 21.49.56.png" title="After resize" width="75%" style="float:right"/></p> |
| 74 | +[3] Attach and configure the data volume for automatic mounting at startup: |
| 75 | +<p><img src="./images/Screenshot 2025-08-15 at 09.14.53.png" title="All filesystem" width="75%" style="float:right"/></p> |
| 76 | +[4] Install the NVIDIA CUDA Toolkit as directed and verify that it is functioning: |
| 77 | +<p><img src="./images/Screenshot 2025-08-14 at 08.03.11.png" title="Successfull NVIDIA Driver Installation" width="75%" style="float:right"/></p> |
| 78 | + |
| 79 | +[Back to top](#toc)<br> |
| 80 | +<br> |
| 81 | + |
| 82 | +<a id="proxysec"></a> |
| 83 | +## Proxy and Security Settings |
| 84 | +### *Proxy* |
| 85 | + |
| 86 | +In the event of a proxy'd network add the following to the `/etc/profile.d/proxy.sh` file to set the proxy environment variables system-wide: |
| 87 | + |
| 88 | +``` |
| 89 | +http_proxy=http://<proxy_server>:80 |
| 90 | +https_proxy=http://<proxy_server>:80 |
| 91 | +no_proxy="127.0.0.1, localhost" |
| 92 | +export http_proxy |
| 93 | +export https_proxy |
| 94 | +export no_proxy |
| 95 | +``` |
| 96 | + |
| 97 | +>[!TIP] |
| 98 | +>The `no_proxy` environment variable can be expanded to include your internal domains. It is not required to list IP addresses in internal subnets of the C3. |
| 99 | +
|
| 100 | +Edit the `/etc/yum.conf` file to include the following line: |
| 101 | +``` |
| 102 | +proxy=http://<proxy_server>:80 |
| 103 | +``` |
| 104 | + |
| 105 | +### *Security* |
| 106 | + |
| 107 | +>[!NOTE] |
| 108 | +>Refer to the article [Why You Should Trust Meta AI's Ollama for Data Security](https://myscale.com/blog/trust-meta-ai-ollama-data-security) for further information on the benefits of running LLMs locally. |
| 109 | +
|
| 110 | +#### Open the Firewall for the Ollama Listening Port |
| 111 | + |
| 112 | +``` |
| 113 | +sudo firewall-cmd –-set-default-zone=public |
| 114 | +``` |
| 115 | +``` |
| 116 | +sudo firewall-cmd –-add-port=11434/tcp --add-service=http –-zone=public |
| 117 | +``` |
| 118 | +``` |
| 119 | +sudo firewall-cmd --runtime-to-permanent |
| 120 | +``` |
| 121 | +``` |
| 122 | +sudo firewall-cmd –reload |
| 123 | +``` |
| 124 | +``` |
| 125 | +sudo firewall-cmd –-info-zone=public |
| 126 | +``` |
| 127 | + |
| 128 | +<p><img src="./images/Screenshot 2025-08-14 at 08.10.55.png" title="Firewall service lising" width="75%" style="float:right"/></p> |
| 129 | + |
| 130 | +#### Grant VCN Access through Security List |
| 131 | + |
| 132 | +Edit your VCN's default security list to reflect the following: |
| 133 | + |
| 134 | +<p><img src="./images/security-list.png" title="Ollama server port access" width="100%" style="float:right"/></p> |
| 135 | + |
| 136 | +Should you want to limit the access to a specific IP address the source should be: |
| 137 | + |
| 138 | +<p><img src="./images/security-list-individual.png" title="Access limited to a single IP address" width="100%" style="float:right"/></p> |
| 139 | + |
| 140 | +>[!TIP] |
| 141 | +>To avoid continuous changes to the security list obtain a reserved IP address for your client machine from the network administrator. |
| 142 | +
|
| 143 | +[Back to top](#toc)<br> |
| 144 | +<br> |
| 145 | + |
| 146 | +<a id="ollama"></a> |
| 147 | +## Ollama Installation |
| 148 | + |
| 149 | +### *General* |
| 150 | + |
| 151 | +The installation comprises the following components: |
| 152 | + |
| 153 | +| Server | Client<sup><sub>1</sup></sub> | |
| 154 | +|----------|----------| |
| 155 | +| Ollama | GUI Tools<sup><sub>2</sup></sub><br>Character based tools<br>API Development kits<sup><sub>3</sup></sub> | |
| 156 | + |
| 157 | +<sup><sub>1</sup></sub>*Optional*<br> |
| 158 | +<sup><sub>2</sup></sub> Examples of GUIs: [Msty](https://msty.app/), [OpenWebUI](https://openwebui.com/), [ollama-chats](https://github.com/drazdra/ollama-chats)<br> |
| 159 | +<sup><sub>3</sup></sub> See [Ollama documentation](https://github.com/ollama/ollama/tree/main/docs) |
| 160 | + |
| 161 | +### *Server Installation* |
| 162 | + |
| 163 | +``` |
| 164 | +cd /tmp |
| 165 | +curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz |
| 166 | +sudo tar -C /usr -xzf ollama-linux-amd64.tgz |
| 167 | +sudo chmod +x /usr/bin/ollama |
| 168 | +sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama |
| 169 | +``` |
| 170 | +``` |
| 171 | +sudo tee /usr/lib/systemd/system/ollama.service > /dev/null <<EOF |
| 172 | +[Unit] |
| 173 | +Description=Ollama Service |
| 174 | +After=network-online.target |
| 175 | +
|
| 176 | +[Service] |
| 177 | +ExecStart=/usr/bin/ollama serve |
| 178 | +User=ollama |
| 179 | +Group=ollama |
| 180 | +Restart=always |
| 181 | +RestartSec=3 |
| 182 | +Environment="HTTPS_PROXY=http:<IP_address>:<port>" |
| 183 | +Environment="OLLAMA_MODELS=/mnt/llm-repo" |
| 184 | +Environment="OLLAMA_HOST=0.0.0.0" |
| 185 | +Environment="OLLAMA_ORIGINS=*" |
| 186 | +
|
| 187 | +[Install] |
| 188 | +WantedBy=default.target |
| 189 | +EOF |
| 190 | +``` |
| 191 | + |
| 192 | +The `Environment="HTTPS_PROXY=http:<IP_address>:<port>"` line should be omitted if a proxy is not applicable.<br> |
| 193 | + |
| 194 | +Enable and start Ollama: |
| 195 | +``` |
| 196 | +sudo systemctl daemon-reload |
| 197 | +sudo systemctl enable ollama |
| 198 | +sudo systemctl start ollama |
| 199 | +``` |
| 200 | + |
| 201 | +Ollama will be accessible at http://127.0.0.1:11434 or http://<you_server_IP>:11434. |
| 202 | + |
| 203 | +Execute: |
| 204 | +``` |
| 205 | +sudo chown ollama:ollama /mnt/llm-repo |
| 206 | +sudo chmod 755 /mnt/llm-repo |
| 207 | +``` |
| 208 | + |
| 209 | +[Back to top](#toc)<br> |
| 210 | +<br> |
| 211 | + |
| 212 | +<a id="test"></a> |
| 213 | +## Testing |
| 214 | + |
| 215 | +From the local host, test the accessibility of the port and the availability of the Ollama server: |
| 216 | + |
| 217 | +``` |
| 218 | +nc -zv ol9-gpu 11434 |
| 219 | +curl http://ol9-gpu:11434 |
| 220 | +curl -I http://ol9-gpu:11434 |
| 221 | +``` |
| 222 | +<p><img src="./images/Screenshot 2025-08-14 at 09.08.30.png" title="Ollama remote test results" width="75%" style="float:right"/></p> |
| 223 | + |
| 224 | +Login to `ol9-gpu` and note the command line options that are available: |
| 225 | + |
| 226 | +``` |
| 227 | +ollama |
| 228 | +``` |
| 229 | + |
| 230 | +<p><img src="./images/ollama-syntax.png" title="Ollama syntax" width="75%" style="float:right"/></p> |
| 231 | + |
| 232 | +Also note the environment variable options that are available (can be set in `/usr/lib/systemd/system/ollama.service`): |
| 233 | + |
| 234 | +``` |
| 235 | +ollama help serve |
| 236 | +``` |
| 237 | + |
| 238 | +<p><img src="./images/ollama-env-var.png" title="Ollama environment variables" width="75%" style="float:right"/></p> |
| 239 | + |
| 240 | +Download and test your first LLM (and you will notice the population of `/mnt/ll-models` with data by running `ls -lR /mnt/ll-models`): |
| 241 | + |
| 242 | +<p><img src="./images/Screenshot 2025-08-15 at 09.15.15.png" title="Ollama pull/test gemma3" width="75%" style="float:right"/></p> |
| 243 | +<p><img src="./images/Screenshot 2025-08-15 at 09.22.42.png" title="Ollama pull/test gemma3" width="75%" style="float:right"/></p> |
| 244 | + |
| 245 | +Ensure that the GPU is enabled: |
| 246 | + |
| 247 | +``` |
| 248 | +nvidia-smi |
| 249 | +sudo journalctl -u ollama |
| 250 | +ollama show gemma3 |
| 251 | +``` |
| 252 | +Look for the Ollama process `/usr/bin/ollama`: |
| 253 | +<p><img src="./images/Screenshot 2025-08-14 at 08.51.44.png" title="GPU present" width="75%" style="float:right"/></p> |
| 254 | +Look for the `L40S` entry: |
| 255 | +<p><img src="./images/Screenshot 2025-08-14 at 08.33.50.png" title="GPU detected in logs" width="75%" style="float:right"/></p> |
| 256 | +Look for the 100% GPU enablement under the PROCESSOR column |
| 257 | +<p><img src="./images/Screenshot 2025-08-14 at 09.04.51.png" title="Model info" width="75%" style="float:right"/></p> |
| 258 | + |
| 259 | +Run some more tests from your client to test the APIs: |
| 260 | + |
| 261 | +``` |
| 262 | +curl http://ol9-gpu:11434/api/tags |
| 263 | +
|
| 264 | +curl -X POST http://ol9-gpu:11434/api/generate -d '{ |
| 265 | + "model": "gemma3", |
| 266 | + "prompt":"Hello Gemma3!", |
| 267 | + "stream": false |
| 268 | + }' |
| 269 | +
|
| 270 | +curl http://ol9-gpu:11434/api/ps |
| 271 | +``` |
| 272 | + |
| 273 | +<p><img src="./images/Screenshot 2025-08-15 at 09.33.32.png" title="Ollama additional tests" width="75%" style="float:right"/></p> |
| 274 | + |
| 275 | +1. `curl http://ol9-gpu:11434/api/tags` returns a list of installed LLMs |
| 276 | +2. `curl http://ol9-gpu:11434/api/ps` returns a list of LLMs already loaded into memory |
| 277 | + |
| 278 | +>[!TIP] |
| 279 | +>The duration that the LLM can stay loaded into memory can be adjusted by changing the `OLLAMA_KEEP_ALIVE` environment parameter (default = 5 minutes). |
| 280 | +
|
| 281 | +[Back to top](#toc)<br> |
| 282 | +<br> |
| 283 | + |
| 284 | +<a id="results"></a> |
| 285 | +## Results |
| 286 | +Since benchmarking is out of scope for this article a cost:benefit comparison between a GPU-enabled VM and a "normal" OPCU-operated VM was made. An OCPU VM was configured to the same cost of a 1-GPU VM run over a period of one month.<br><br> |
| 287 | +The following script was used:<br> |
| 288 | + |
| 289 | +``` |
| 290 | +curl http://localhost:11434/api/generate -d '{ |
| 291 | + "model": "gemma3", |
| 292 | + "prompt": "What is a cyanobacterial bloom?", |
| 293 | + "stream":false |
| 294 | +}' |
| 295 | +``` |
| 296 | +<br> |
| 297 | +The comparison is summarised in the following table: |
| 298 | +<br><br> |
| 299 | + |
| 300 | +| Metric | OCPU<br>Duration (s) | GPU<br>Duration (s) | Difference<br>Factor | |
| 301 | +|----|----|----|----| |
| 302 | +| total_duration | 23.46 | 6.14 | 3.8x | |
| 303 | +| load_duration | 0.07 | 0.06 | 1.2x | |
| 304 | +| prompt_eval_count | 16 | 16 | | |
| 305 | +| prompt_eval_duration | 0.03 | 0.02 | 1.8x | |
| 306 | +| eval_count | 799 | 819 | | |
| 307 | +| eval_duration | 23.35 | 6.06 | 3.8x | |
| 308 | + |
| 309 | +Where:<br><br> |
| 310 | +```total_duration```: time spent generating the response<br> |
| 311 | +```load_duration```: time spent loading the model<br> |
| 312 | +```prompt_eval_count```: number of tokens in the prompt<br> |
| 313 | +```prompt_eval_duration```: time spent evaluating the prompt<br> |
| 314 | +```eval_count```: number of tokens in the response<br> |
| 315 | +```eval_duration```: time spent generating the response<br> |
| 316 | + |
| 317 | +[Back to top](#toc)<br> |
| 318 | +<br> |
| 319 | + |
| 320 | +<a id="ref"></a> |
| 321 | +## Useful References |
| 322 | + |
| 323 | +* *[Ollama documentation](https://github.com/ollama/ollama/tree/main/docs)* |
| 324 | +* *[Pre-trained Ollama models](https://ollama.com/library)* |
| 325 | +* *[Msty GUI client](https://msty.app/)* |
| 326 | +* *[OpenWebUI](https://github.com/open-webui/open-webui)* |
| 327 | +* *[Ollama-chats](https://github.com/drazdra/ollama-chats)* |
| 328 | +* *[Ollama Python library](https://github.com/ollama/ollama-python)* |
| 329 | +* *[Getting started with Ollama for Python](https://github.com/RamiKrispin/ollama-poc)* |
| 330 | +* *[Ollama and Oracle Database 23ai vector search](https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/generate-summary-using-ollama.html)* |
| 331 | +* *[Ollama API Usage Examples](https://www.gpu-mart.com/blog/ollama-api-usage-examples)* |
| 332 | +* *[Ollama API](https://github.com/ollama/ollama/blob/main/docs/api.md)* |
| 333 | +* *[Ollama Structured Outputs](https://ollama.com/blog/structured-outputs)* |
| 334 | +* *[Ollama RAG: Reading Assistant Using OLLama for Text Chatting](https://github.com/mtayyab2/RAG)* |
| 335 | +* *[RAG with LLaMA Using Ollama: A Deep Dive into Retrieval-Augmented Generation](https://medium.com/@danushidk507/rag-with-llama-using-ollama-a-deep-dive-into-retrieval-augmented-generation-c58b9a1cfcd3)* |
| 336 | +* *[Ollama: Embedding Models](https://ollama.com/blog/embedding-models)* |
| 337 | +* *[Ollama LLM RAG](https://github.com/digithree/ollama-rag)* |
| 338 | +* *[Ollama's new engine for multimodal models](https://ollama.com/blog/multimodal-models)* |
| 339 | + |
| 340 | +[Back to top](#toc)<br> |
| 341 | +<br> |
| 342 | + |
| 343 | +## License |
| 344 | +Copyright (c) 2025 Oracle and/or its affiliates. |
| 345 | + |
| 346 | +Licensed under the Universal Permissive License (UPL), Version 1.0. |
| 347 | + |
| 348 | +See [LICENSE](LICENSE) for more details. |
| 349 | + |
| 350 | + |
0 commit comments