Skip to content

Commit c1f53b0

Browse files
Self hosting guide (#32)
1 parent a1923c6 commit c1f53b0

File tree

7 files changed

+247
-1
lines changed

7 files changed

+247
-1
lines changed

deployment/amd-gpu.jpg

182 KB
Loading

deployment/amd-gpu.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: AMD GPU
3+
nav_order: 11
4+
parent: Deployment
5+
review_date: 2026-01-28
6+
guide_category:
7+
- Data centre
8+
guide_banner: amd-gpu.jpg
9+
guide_category_order: 2
10+
guide_description: Self-hosting TrustGraph with vLLM on AMD GPU accelerators
11+
guide_difficulty: advanced
12+
guide_time: 2 - 4 hr
13+
guide_emoji: 🔴
14+
guide_labels:
15+
- AMD
16+
- GPU
17+
- vLLM
18+
todo: true
19+
todo_notes: Placeholder page - content to be added.
20+
published: false
21+
---
22+
23+
# Self-Hosting with vLLM and AMD GPU
24+
25+
{% capture requirements %}
26+
<ul style="margin: 0; padding-left: 20px;">
27+
<li>AMD GPU with ROCm support (e.g., RX 7900 XTX, MI series)</li>
28+
<li>ROCm drivers installed</li>
29+
<li>Docker or Podman with GPU passthrough configured</li>
30+
<li>Python {{site.data.software.python-min-version}}+ for CLI tools</li>
31+
<li>Basic command-line familiarity</li>
32+
</ul>
33+
{% endcapture %}
34+
35+
{% include guide/guide-intro-box.html
36+
description=page.guide_description
37+
difficulty=page.guide_difficulty
38+
duration=page.guide_time
39+
you_will_need=requirements
40+
goal="Deploy TrustGraph with vLLM running on AMD GPU hardware for high-performance local inference."
41+
%}
42+

deployment/intel-ai.jpg

191 KB
Loading
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ parent: Deployment
55
review_date: 2025-12-02
66
guide_category:
77
- Data centre
8-
guide_category_order: 1
8+
guide_category_order: 3
99
guide_description: High-performance AI deployment with Intel Gaudi and GPU accelerators for large models
1010
guide_difficulty: advanced
1111
guide_time: 5 - 10 hr
@@ -18,6 +18,7 @@ guide_labels:
1818
todo: true
1919
todo_notes: This is a holding page. Work on Intel GPU integration is ongoing.
2020
Come discuss on the Discord if you're exploring using this as a baseline.
21+
published: false
2122
---
2223

2324
# Intel Gaudi Cloud Deployment

deployment/intel-gpu.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Intel GPU
3+
nav_order: 11
4+
parent: Deployment
5+
review_date: 2026-01-28
6+
guide_category:
7+
- Data centre
8+
guide_banner: intel-ai.jpg
9+
guide_category_order: 2
10+
guide_description: Self-hosting TrustGraph with vLLM on Intel GPU accelerators
11+
guide_difficulty: advanced
12+
guide_time: 2 - 4 hr
13+
guide_emoji: 🔴
14+
guide_labels:
15+
- GPU
16+
- vLLM
17+
- Intel Arc
18+
todo: true
19+
todo_notes: Placeholder page - content to be added.
20+
---
21+
22+
# Self-Hosting with vLLM and Intel GPU
23+
24+
{% capture requirements %}
25+
<ul style="margin: 0; padding-left: 20px;">
26+
<li>Intel GPU with ??? support (e.g., Intel Arc B60)</li>
27+
<li>ROCm drivers installed</li>
28+
<li>Docker or Podman with GPU passthrough configured</li>
29+
<li>Python {{site.data.software.python-min-version}}+ for CLI tools</li>
30+
<li>Basic command-line familiarity</li>
31+
</ul>
32+
{% endcapture %}
33+
34+
{% include guide/guide-intro-box.html
35+
description=page.guide_description
36+
difficulty=page.guide_difficulty
37+
duration=page.guide_time
38+
you_will_need=requirements
39+
goal="Deploy TrustGraph with vLLM running on Intel GPU hardware for high-performance local inference."
40+
%}
41+

deployment/self-hosted-gpu.jpg

67.3 KB
Loading

deployment/self-hosting-intro.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
---
2+
title: Introduction to Self-Hosting
3+
nav_order: 10
4+
parent: Deployment
5+
review_date: 2026-01-28
6+
guide_category:
7+
- Data centre
8+
guide_banner: self-hosted-gpu.jpg
9+
guide_category_order: 1
10+
guide_description: Overview of GPU options and self-hosting software for
11+
running TrustGraph locally
12+
guide_difficulty: beginner
13+
guide_time: 10 min
14+
guide_emoji: 🏠
15+
guide_labels:
16+
- Self-Hosting
17+
- Introduction
18+
---
19+
20+
# Introduction to Self-Hosting
21+
22+
{% include guide/guide-intro-box.html
23+
description=page.guide_description
24+
difficulty=page.guide_difficulty
25+
duration=page.guide_time
26+
goal="Get an overview of GPU options for self-hosting and understand the landscape of accessible self-hosting software."
27+
%}
28+
29+
## Running Your Own Models
30+
31+
There are three broad approaches to running LLMs:
32+
33+
**Cloud LLM APIs** (OpenAI, Anthropic, Google, etc.) - You send prompts to a provider's API and pay per token. Simple to use, but your data is processed on their servers and costs scale with usage.
34+
35+
**Rented GPU infrastructure** - You rent GPU time from providers like RunPod, Lambda Labs, or cloud GPU instances. You run the model yourself, controlling the software stack, but on hardware you don't own. Costs are typically hourly or monthly.
36+
37+
**Self-hosting on your own hardware** - You own the GPUs and run everything on-premises. The cost model is capital expenditure on equipment plus electricity, rather than ongoing rental or per-token fees.
38+
39+
This guide focuses on the last two approaches - running models yourself rather than using hosted APIs. The key benefits:
40+
41+
- **Data control** - your prompts and documents stay on infrastructure you control
42+
- **Predictable costs** - no per-token charges; costs are fixed (owned hardware) or time-based (rented)
43+
- **Flexibility** - run any model, configure it how you want
44+
45+
Rented GPU infrastructure gives you most of these benefits without the upfront hardware investment. True self-hosting on owned hardware makes sense when you have consistent, long-term workloads where the capital cost pays off.
46+
47+
Most users run quantized models to reduce memory requirements. Quantization (Q4, Q8, etc.) compresses model weights, allowing larger models to fit on consumer GPUs with modest VRAM. All the tools covered in this guide support quantized models.
48+
49+
## GPU Options
50+
51+
Running LLMs efficiently requires GPU acceleration. Here are the main options:
52+
53+
### NVIDIA (CUDA)
54+
55+
The most widely supported option. Nearly all LLM software supports NVIDIA GPUs out of the box. Consumer cards (RTX 3090, 4090) work well for smaller models. Data centre cards (A100, H100) handle larger models and higher throughput.
56+
57+
**Pros:** Best software compatibility, mature ecosystem
58+
**Cons:** Premium pricing, supply constraints on high-end cards
59+
60+
### AMD (ROCm)
61+
62+
AMD GPUs offer competitive performance at lower prices. ROCm (AMD's GPU compute platform) support has improved significantly. Cards like the RX 7900 XTX or Instinct MI series work with vLLM and other inference servers.
63+
64+
**Pros:** Better price/performance, improving software support
65+
**Cons:** Narrower software compatibility than NVIDIA
66+
67+
### Intel Arc
68+
69+
Intel Arc GPUs are an accessible option for self-hosting. Lower cost than
70+
equivalent NVIDIA cards, and modest power requirements make them easier to
71+
host - no need for specialist cooling or high-current power supplies. Software
72+
support is maturing, with TGI and other inference servers now supporting Arc.
73+
74+
**Pros:** Lower cost, reasonable power draw, easy to host
75+
**Cons:** Smaller ecosystem than NVIDIA, still maturing
76+
77+
Intel's Gaudi accelerators are a separate, specialist option for data centre deployments - purpose-built for AI but not widely available outside cloud services.
78+
79+
### Google TPUs
80+
81+
Tensor Processing Units are Google's custom AI accelerators. Available through Google Cloud or as Edge TPUs for embedded use. Not typically used for on-premises self-hosting.
82+
83+
**Pros:** Excellent performance for supported models
84+
**Cons:** Cloud-only for most use cases
85+
86+
## Self-Hosting Software
87+
88+
This section covers the most accessible options for running LLMs
89+
locally. TrustGraph has direct integrations for Ollama, vLLM, llama.cpp, TGI,
90+
and LM Studio. TrustGraph also supports the OpenAI API, which is commonly
91+
exposed by other self-hosting tools - so if your preferred option isn't
92+
listed, there's a good chance it likely still works.
93+
94+
### Ollama
95+
96+
The easiest way to get started. Ollama is a lightweight tool that handles
97+
model downloads, GPU detection, and serving - all through a simple
98+
interface. It manages the complexity of model formats and GPU configuration
99+
for you. Available for Linux, macOS, and Windows. Supports NVIDIA, AMD (ROCm),
100+
and Apple Metal GPUs. Uses GGUF model format. MIT licence.
101+
102+
**Best for:** Getting started quickly, simple setups, learning
103+
104+
### llama.cpp / llamafile / llama-server
105+
106+
These are related projects built on the same foundation. **llama.cpp** is a
107+
C++ library that runs LLMs efficiently with minimal
108+
dependencies. **llamafile** packages a model and the runtime into a single
109+
executable - download one file and run it, nothing to
110+
install. **llama-server** is an HTTP server that exposes an OpenAI-compatible
111+
API.
112+
113+
The llama.cpp ecosystem is lightweight and portable. It works on CPU (slower
114+
but no GPU required) or GPU (fast). A good choice when you want something
115+
minimal with few dependencies. Has the broadest GPU support: NVIDIA, AMD
116+
(ROCm), Apple Metal, and Vulkan for other GPUs. Uses GGUF model format. MIT
117+
licence.
118+
119+
**Best for:** Lightweight deployments, CPU fallback, maximum portability
120+
121+
### vLLM
122+
123+
A high-performance inference server built for production workloads. vLLM
124+
implements optimisations like continuous batching (processing multiple
125+
requests efficiently) and PagedAttention (better memory management) to
126+
maximise throughput. If you need to serve many concurrent users or process
127+
high volumes, vLLM is designed for that.
128+
129+
NVIDIA support is in the main distribution. AMD/ROCm support was tracked as a
130+
fork but is being merged into the main distribution. Intel support is
131+
available via Intel-maintained forks. Uses safetensors/Hugging Face model
132+
format. Apache 2.0 licence.
133+
134+
**Best for:** Production deployments, high throughput, concurrent users
135+
136+
### Text Generation Inference (TGI)
137+
138+
Hugging Face's production inference server. Similar goals to vLLM - optimised
139+
for throughput and low latency. Tight integration with the Hugging Face model
140+
hub makes it easy to deploy models hosted there. NVIDIA support in the main
141+
distribution; Intel GPU support via Intel's contributions. Uses
142+
safetensors/Hugging Face model format. Apache 2.0 licence.
143+
144+
**Best for:** Hugging Face ecosystem, production deployments, Intel hardware
145+
146+
### LM Studio
147+
148+
A desktop application with a graphical interface. You can browse and download
149+
models from a built-in catalogue, adjust parameters like temperature and
150+
context length, and run a local server - all without touching the command
151+
line. Supports NVIDIA, AMD, and Apple Metal GPUs. Uses GGUF model format. Free
152+
for personal use; commercial use requires a paid licence.
153+
154+
**Best for:** Users who prefer a GUI, experimentation, non-technical users
155+
156+
### A note on performance
157+
158+
Performance of self-hosted inference is heavily affected by GPU-specific
159+
optimisations. vLLM and TGI have the most mature support for these
160+
optimisations, which is why they're the preferred choice for production
161+
deployments.
162+

0 commit comments

Comments
 (0)