Skip to content

Commit 09b00b2

Browse files
authored
Merge branch 'main' into add-complex-document-rag
2 parents e28f173 + ad37299 commit 09b00b2

20 files changed

+353
-2
lines changed
Lines changed: 350 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,350 @@
1+
*Last Update: 15.08.2025*
2+
3+
<br><h1 align="center">GPU-Assisted Local LLM Serving<br>Using the Ollama Open Source Tool and Compute Cloud@Customer</h1>
4+
5+
<a id="toc"></a>
6+
## Table of Contents
7+
1. [Introduction](#intro)
8+
2. [VM Creation and Configuration](#create)
9+
3. [Proxy and Security Settings](#proxysec)
10+
4. [Ollama Installation](#ollama)
11+
5. [Testing](#test)
12+
6. [Comparative Results](#results)
13+
6. [Useful References](#ref)
14+
15+
<a id="intro"></a>
16+
## Introduction
17+
18+
This article builds on the [Ollama Local LLM](cloud-infrastructure/private-cloud-and-edge/compute-cloud-at-customer/local-llm) article that can be used as a reference for implementation and resources. It also serves to demonstrate the differences when building an Ollama installation using the Compute Cloud@Customer (C3) GPU expansion option. This platform is suited to AI, computational and multimedia workloads. It supports inferencing, Retrieval Augmented Generation (RAG), training and fine-tuning for small to medium LLMs (approx 70b parameters) in a 4-GPU VM configuration.
19+
20+
Please consult the [Compute Cloud@Customer Compute Expansion Options datasheet](https://www.oracle.com/uk/a/ocom/docs/[email protected]) for a full description and specifications of the GPU nodes.
21+
22+
The high level differences between a general-purpose VM and a GPU-enabled VM are:
23+
* For a GPU-enabled VM, High Performance storage is normally used for boot volumes, LLM- & RAG stores and vector databases. Balanced Performance storage may however be used where appropriate, e.g. object- and file storage.
24+
* An extended boot volume is required to accommodate required GPU drivers, SDKs, bulky system libraries, applications and development space.
25+
* GPU-specific platform images must be used to incorporate GPUs into VMs.
26+
27+
VM shapes are available in:
28+
| Shape | Resources |
29+
|-------|-----------|
30+
| C3.VM.GPU.L40S.1 | 1 GPU<br>27 OCPUs<br>200 GB memory |
31+
| C3.VM.GPU.L40S.2 | 2 GPUs<br>54 OCPUs<br>400 GB memory |
32+
| C3.VM.GPU.L40S.3 | 3 GPUs<br>81 OCPUs<br>600 GB memory |
33+
| C3.VM.GPU.L40S.4 | 4 GPUs<br>108 OCPUs<br>800 GB memory |
34+
<br>
35+
36+
Considerations:
37+
* A firm grasp of C3 and OCI concepts and administration is assumed.
38+
* Familiarity with Linux, in particular Oracle Linux 9 for the server configuration is assumed.
39+
* The creation and integration of a development environment is outside of the scope of this document.
40+
* Oracle Linux 9 and macOS Sequoia 15.6 clients were used for testing but Windows is however widely supported.
41+
42+
[Back to top](#toc)<br>
43+
<br>
44+
45+
<a id="create"></a>
46+
47+
## VM Creation and Configuration
48+
### *References:*
49+
* *[Creating an Instance in the Oracle Cloud Infrastructure Documentation](https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/compute/creating-an-instance.htm#creating-an-instance)*
50+
* *[Creating and Attaching Block Volumes in the Oracle Cloud Infrastructure Documentation](https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/block/creating-and-attaching-block-volumes.htm)*
51+
* *[Compute Shapes (GPU Shapes) in the Oracle Cloud Infrastructure Documentation](https://docs.oracle.com/en-us/iaas/compute-cloud-at-customer/topics/compute/compute-shapes.htm#compute-shapes__compute-shape-gpu)*
52+
<br><br>
53+
54+
| Requirement | Specification | Remarks |
55+
|----------|----------|----------|
56+
| Shape | C3.VM.GPU.L40S.[1234] | Select either 1, 2, 3 or 4 GPUs |
57+
| Image | Oracle Linux 8/9 | Note the difference in the NVIDIA driver installation instructions |
58+
| Boot Volume | >= 150 GB on Performance Storage<br>(VPU = 20) | * Upwards of 150 GB is recommended to accommodate a large software footprint<br>* 250 GB was used for this article |
59+
| Data Volume | >= 250 GB on Performance Storage<br>(VPU = 20) | * Depends on the size of LLMs that will be used<br>* 250 GB was used for this article |
60+
| NVIDIA Drivers | Latest | Install as per NVIDIA CUDA Toolkit instructions in the above GPU Shape documentation |
61+
| Hostname | ol9-gpu ||
62+
| Public IP | Yes ||
63+
<br>
64+
65+
### *Boot Volume Creation Specification:*
66+
<p><img src="./images/Screenshot 2025-08-13 at 09.33.08.png" title="Boot Volume Specification" width="75%" style="float:right"/></p>
67+
68+
### *Next Steps:*
69+
[1] Connect to the VM<br>
70+
[2] Extend the root filesystem:<br>
71+
```sudo /usr/libexec/oci-growfs -y```
72+
<p><img src="./images/Screenshot 2025-08-13 at 21.49.30.png" title="Before resize" width="75%" style="float:right"/></p>
73+
<p><img src="./images/Screenshot 2025-08-13 at 21.49.56.png" title="After resize" width="75%" style="float:right"/></p>
74+
[3] Attach and configure the data volume for automatic mounting at startup:
75+
<p><img src="./images/Screenshot 2025-08-15 at 09.14.53.png" title="All filesystem" width="75%" style="float:right"/></p>
76+
[4] Install the NVIDIA CUDA Toolkit as directed and verify that it is functioning:
77+
<p><img src="./images/Screenshot 2025-08-14 at 08.03.11.png" title="Successfull NVIDIA Driver Installation" width="75%" style="float:right"/></p>
78+
79+
[Back to top](#toc)<br>
80+
<br>
81+
82+
<a id="proxysec"></a>
83+
## Proxy and Security Settings
84+
### *Proxy*
85+
86+
In the event of a proxy'd network add the following to the `/etc/profile.d/proxy.sh` file to set the proxy environment variables system-wide:
87+
88+
```
89+
http_proxy=http://<proxy_server>:80
90+
https_proxy=http://<proxy_server>:80
91+
no_proxy="127.0.0.1, localhost"
92+
export http_proxy
93+
export https_proxy
94+
export no_proxy
95+
```
96+
97+
>[!TIP]
98+
>The `no_proxy` environment variable can be expanded to include your internal domains. It is not required to list IP addresses in internal subnets of the C3.
99+
100+
Edit the `/etc/yum.conf` file to include the following line:
101+
```
102+
proxy=http://<proxy_server>:80
103+
```
104+
105+
### *Security*
106+
107+
>[!NOTE]
108+
>Refer to the article [Why You Should Trust Meta AI's Ollama for Data Security](https://myscale.com/blog/trust-meta-ai-ollama-data-security) for further information on the benefits of running LLMs locally.
109+
110+
#### Open the Firewall for the Ollama Listening Port
111+
112+
```
113+
sudo firewall-cmd –-set-default-zone=public
114+
```
115+
```
116+
sudo firewall-cmd –-add-port=11434/tcp --add-service=http –-zone=public
117+
```
118+
```
119+
sudo firewall-cmd --runtime-to-permanent
120+
```
121+
```
122+
sudo firewall-cmd –reload
123+
```
124+
```
125+
sudo firewall-cmd –-info-zone=public
126+
```
127+
128+
<p><img src="./images/Screenshot 2025-08-14 at 08.10.55.png" title="Firewall service lising" width="75%" style="float:right"/></p>
129+
130+
#### Grant VCN Access through Security List
131+
132+
Edit your VCN's default security list to reflect the following:
133+
134+
<p><img src="./images/security-list.png" title="Ollama server port access" width="100%" style="float:right"/></p>
135+
136+
Should you want to limit the access to a specific IP address the source should be:
137+
138+
<p><img src="./images/security-list-individual.png" title="Access limited to a single IP address" width="100%" style="float:right"/></p>
139+
140+
>[!TIP]
141+
>To avoid continuous changes to the security list obtain a reserved IP address for your client machine from the network administrator.
142+
143+
[Back to top](#toc)<br>
144+
<br>
145+
146+
<a id="ollama"></a>
147+
## Ollama Installation
148+
149+
### *General*
150+
151+
The installation comprises the following components:
152+
153+
| Server | Client<sup><sub>1</sup></sub> |
154+
|----------|----------|
155+
| Ollama | GUI Tools<sup><sub>2</sup></sub><br>Character based tools<br>API Development kits<sup><sub>3</sup></sub> |
156+
157+
<sup><sub>1</sup></sub>*Optional*<br>
158+
<sup><sub>2</sup></sub> Examples of GUIs: [Msty](https://msty.app/), [OpenWebUI](https://openwebui.com/), [ollama-chats](https://github.com/drazdra/ollama-chats)<br>
159+
<sup><sub>3</sup></sub> See [Ollama documentation](https://github.com/ollama/ollama/tree/main/docs)
160+
161+
### *Server Installation*
162+
163+
```
164+
cd /tmp
165+
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
166+
sudo tar -C /usr -xzf ollama-linux-amd64.tgz
167+
sudo chmod +x /usr/bin/ollama
168+
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
169+
```
170+
```
171+
sudo tee /usr/lib/systemd/system/ollama.service > /dev/null <<EOF
172+
[Unit]
173+
Description=Ollama Service
174+
After=network-online.target
175+
176+
[Service]
177+
ExecStart=/usr/bin/ollama serve
178+
User=ollama
179+
Group=ollama
180+
Restart=always
181+
RestartSec=3
182+
Environment="HTTPS_PROXY=http:<IP_address>:<port>"
183+
Environment="OLLAMA_MODELS=/mnt/llm-repo"
184+
Environment="OLLAMA_HOST=0.0.0.0"
185+
Environment="OLLAMA_ORIGINS=*"
186+
187+
[Install]
188+
WantedBy=default.target
189+
EOF
190+
```
191+
192+
The `Environment="HTTPS_PROXY=http:<IP_address>:<port>"` line should be omitted if a proxy is not applicable.<br>
193+
194+
Enable and start Ollama:
195+
```
196+
sudo systemctl daemon-reload
197+
sudo systemctl enable ollama
198+
sudo systemctl start ollama
199+
```
200+
201+
Ollama will be accessible at http://127.0.0.1:11434 or http://<you_server_IP>:11434.
202+
203+
Execute:
204+
```
205+
sudo chown ollama:ollama /mnt/llm-repo
206+
sudo chmod 755 /mnt/llm-repo
207+
```
208+
209+
[Back to top](#toc)<br>
210+
<br>
211+
212+
<a id="test"></a>
213+
## Testing
214+
215+
From the local host, test the accessibility of the port and the availability of the Ollama server:
216+
217+
```
218+
nc -zv ol9-gpu 11434
219+
curl http://ol9-gpu:11434
220+
curl -I http://ol9-gpu:11434
221+
```
222+
<p><img src="./images/Screenshot 2025-08-14 at 09.08.30.png" title="Ollama remote test results" width="75%" style="float:right"/></p>
223+
224+
Login to `ol9-gpu` and note the command line options that are available:
225+
226+
```
227+
ollama
228+
```
229+
230+
<p><img src="./images/ollama-syntax.png" title="Ollama syntax" width="75%" style="float:right"/></p>
231+
232+
Also note the environment variable options that are available (can be set in `/usr/lib/systemd/system/ollama.service`):
233+
234+
```
235+
ollama help serve
236+
```
237+
238+
<p><img src="./images/ollama-env-var.png" title="Ollama environment variables" width="75%" style="float:right"/></p>
239+
240+
Download and test your first LLM (and you will notice the population of `/mnt/ll-models` with data by running `ls -lR /mnt/ll-models`):
241+
242+
<p><img src="./images/Screenshot 2025-08-15 at 09.15.15.png" title="Ollama pull/test gemma3" width="75%" style="float:right"/></p>
243+
<p><img src="./images/Screenshot 2025-08-15 at 09.22.42.png" title="Ollama pull/test gemma3" width="75%" style="float:right"/></p>
244+
245+
Ensure that the GPU is enabled:
246+
247+
```
248+
nvidia-smi
249+
sudo journalctl -u ollama
250+
ollama show gemma3
251+
```
252+
Look for the Ollama process `/usr/bin/ollama`:
253+
<p><img src="./images/Screenshot 2025-08-14 at 08.51.44.png" title="GPU present" width="75%" style="float:right"/></p>
254+
Look for the `L40S` entry:
255+
<p><img src="./images/Screenshot 2025-08-14 at 08.33.50.png" title="GPU detected in logs" width="75%" style="float:right"/></p>
256+
Look for the 100% GPU enablement under the PROCESSOR column
257+
<p><img src="./images/Screenshot 2025-08-14 at 09.04.51.png" title="Model info" width="75%" style="float:right"/></p>
258+
259+
Run some more tests from your client to test the APIs:
260+
261+
```
262+
curl http://ol9-gpu:11434/api/tags
263+
264+
curl -X POST http://ol9-gpu:11434/api/generate -d '{
265+
"model": "gemma3",
266+
"prompt":"Hello Gemma3!",
267+
"stream": false
268+
}'
269+
270+
curl http://ol9-gpu:11434/api/ps
271+
```
272+
273+
<p><img src="./images/Screenshot 2025-08-15 at 09.33.32.png" title="Ollama additional tests" width="75%" style="float:right"/></p>
274+
275+
1. `curl http://ol9-gpu:11434/api/tags` returns a list of installed LLMs
276+
2. `curl http://ol9-gpu:11434/api/ps` returns a list of LLMs already loaded into memory
277+
278+
>[!TIP]
279+
>The duration that the LLM can stay loaded into memory can be adjusted by changing the `OLLAMA_KEEP_ALIVE` environment parameter (default = 5 minutes).
280+
281+
[Back to top](#toc)<br>
282+
<br>
283+
284+
<a id="results"></a>
285+
## Results
286+
Since benchmarking is out of scope for this article a cost:benefit comparison between a GPU-enabled VM and a "normal" OPCU-operated VM was made. An OCPU VM was configured to the same cost of a 1-GPU VM run over a period of one month.<br><br>
287+
The following script was used:<br>
288+
289+
```
290+
curl http://localhost:11434/api/generate -d '{
291+
"model": "gemma3",
292+
"prompt": "What is a cyanobacterial bloom?",
293+
"stream":false
294+
}'
295+
```
296+
<br>
297+
The comparison is summarised in the following table:
298+
<br><br>
299+
300+
| Metric | OCPU<br>Duration (s) | GPU<br>Duration (s) | Difference<br>Factor |
301+
|----|----|----|----|
302+
| total_duration | 23.46 | 6.14 | 3.8x |
303+
| load_duration | 0.07 | 0.06 | 1.2x |
304+
| prompt_eval_count | 16 | 16 | |
305+
| prompt_eval_duration | 0.03 | 0.02 | 1.8x |
306+
| eval_count | 799 | 819 | |
307+
| eval_duration | 23.35 | 6.06 | 3.8x |
308+
309+
Where:<br><br>
310+
```total_duration```: time spent generating the response<br>
311+
```load_duration```: time spent loading the model<br>
312+
```prompt_eval_count```: number of tokens in the prompt<br>
313+
```prompt_eval_duration```: time spent evaluating the prompt<br>
314+
```eval_count```: number of tokens in the response<br>
315+
```eval_duration```: time spent generating the response<br>
316+
317+
[Back to top](#toc)<br>
318+
<br>
319+
320+
<a id="ref"></a>
321+
## Useful References
322+
323+
* *[Ollama documentation](https://github.com/ollama/ollama/tree/main/docs)*
324+
* *[Pre-trained Ollama models](https://ollama.com/library)*
325+
* *[Msty GUI client](https://msty.app/)*
326+
* *[OpenWebUI](https://github.com/open-webui/open-webui)*
327+
* *[Ollama-chats](https://github.com/drazdra/ollama-chats)*
328+
* *[Ollama Python library](https://github.com/ollama/ollama-python)*
329+
* *[Getting started with Ollama for Python](https://github.com/RamiKrispin/ollama-poc)*
330+
* *[Ollama and Oracle Database 23ai vector search](https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/generate-summary-using-ollama.html)*
331+
* *[Ollama API Usage Examples](https://www.gpu-mart.com/blog/ollama-api-usage-examples)*
332+
* *[Ollama API](https://github.com/ollama/ollama/blob/main/docs/api.md)*
333+
* *[Ollama Structured Outputs](https://ollama.com/blog/structured-outputs)*
334+
* *[Ollama RAG: Reading Assistant Using OLLama for Text Chatting](https://github.com/mtayyab2/RAG)*
335+
* *[RAG with LLaMA Using Ollama: A Deep Dive into Retrieval-Augmented Generation](https://medium.com/@danushidk507/rag-with-llama-using-ollama-a-deep-dive-into-retrieval-augmented-generation-c58b9a1cfcd3)*
336+
* *[Ollama: Embedding Models](https://ollama.com/blog/embedding-models)*
337+
* *[Ollama LLM RAG](https://github.com/digithree/ollama-rag)*
338+
* *[Ollama's new engine for multimodal models](https://ollama.com/blog/multimodal-models)*
339+
340+
[Back to top](#toc)<br>
341+
<br>
342+
343+
## License
344+
Copyright (c) 2025 Oracle and/or its affiliates.
345+
346+
Licensed under the Universal Permissive License (UPL), Version 1.0.
347+
348+
See [LICENSE](LICENSE) for more details.
349+
350+
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
112 KB
Loading
305 KB
Loading
558 KB
Loading
250 KB
Loading
316 KB
Loading
253 KB
Loading
486 KB
Loading
520 KB
Loading

0 commit comments

Comments
 (0)