|
| 1 | +--- |
| 2 | +title: Introduction to Self-Hosting |
| 3 | +nav_order: 10 |
| 4 | +parent: Deployment |
| 5 | +review_date: 2026-01-28 |
| 6 | +guide_category: |
| 7 | + - Data centre |
| 8 | +guide_banner: self-hosted-gpu.jpg |
| 9 | +guide_category_order: 1 |
| 10 | +guide_description: Overview of GPU options and self-hosting software for |
| 11 | + running TrustGraph locally |
| 12 | +guide_difficulty: beginner |
| 13 | +guide_time: 10 min |
| 14 | +guide_emoji: 🏠 |
| 15 | +guide_labels: |
| 16 | + - Self-Hosting |
| 17 | + - Introduction |
| 18 | +--- |
| 19 | + |
| 20 | +# Introduction to Self-Hosting |
| 21 | + |
| 22 | +{% include guide/guide-intro-box.html |
| 23 | + description=page.guide_description |
| 24 | + difficulty=page.guide_difficulty |
| 25 | + duration=page.guide_time |
| 26 | + goal="Get an overview of GPU options for self-hosting and understand the landscape of accessible self-hosting software." |
| 27 | +%} |
| 28 | + |
| 29 | +## Running Your Own Models |
| 30 | + |
| 31 | +There are three broad approaches to running LLMs: |
| 32 | + |
| 33 | +**Cloud LLM APIs** (OpenAI, Anthropic, Google, etc.) - You send prompts to a provider's API and pay per token. Simple to use, but your data is processed on their servers and costs scale with usage. |
| 34 | + |
| 35 | +**Rented GPU infrastructure** - You rent GPU time from providers like RunPod, Lambda Labs, or cloud GPU instances. You run the model yourself, controlling the software stack, but on hardware you don't own. Costs are typically hourly or monthly. |
| 36 | + |
| 37 | +**Self-hosting on your own hardware** - You own the GPUs and run everything on-premises. The cost model is capital expenditure on equipment plus electricity, rather than ongoing rental or per-token fees. |
| 38 | + |
| 39 | +This guide focuses on the last two approaches - running models yourself rather than using hosted APIs. The key benefits: |
| 40 | + |
| 41 | +- **Data control** - your prompts and documents stay on infrastructure you control |
| 42 | +- **Predictable costs** - no per-token charges; costs are fixed (owned hardware) or time-based (rented) |
| 43 | +- **Flexibility** - run any model, configure it how you want |
| 44 | + |
| 45 | +Rented GPU infrastructure gives you most of these benefits without the upfront hardware investment. True self-hosting on owned hardware makes sense when you have consistent, long-term workloads where the capital cost pays off. |
| 46 | + |
| 47 | +Most users run quantized models to reduce memory requirements. Quantization (Q4, Q8, etc.) compresses model weights, allowing larger models to fit on consumer GPUs with modest VRAM. All the tools covered in this guide support quantized models. |
| 48 | + |
| 49 | +## GPU Options |
| 50 | + |
| 51 | +Running LLMs efficiently requires GPU acceleration. Here are the main options: |
| 52 | + |
| 53 | +### NVIDIA (CUDA) |
| 54 | + |
| 55 | +The most widely supported option. Nearly all LLM software supports NVIDIA GPUs out of the box. Consumer cards (RTX 3090, 4090) work well for smaller models. Data centre cards (A100, H100) handle larger models and higher throughput. |
| 56 | + |
| 57 | +**Pros:** Best software compatibility, mature ecosystem |
| 58 | +**Cons:** Premium pricing, supply constraints on high-end cards |
| 59 | + |
| 60 | +### AMD (ROCm) |
| 61 | + |
| 62 | +AMD GPUs offer competitive performance at lower prices. ROCm (AMD's GPU compute platform) support has improved significantly. Cards like the RX 7900 XTX or Instinct MI series work with vLLM and other inference servers. |
| 63 | + |
| 64 | +**Pros:** Better price/performance, improving software support |
| 65 | +**Cons:** Narrower software compatibility than NVIDIA |
| 66 | + |
| 67 | +### Intel Arc |
| 68 | + |
| 69 | +Intel Arc GPUs are an accessible option for self-hosting. Lower cost than |
| 70 | +equivalent NVIDIA cards, and modest power requirements make them easier to |
| 71 | +host - no need for specialist cooling or high-current power supplies. Software |
| 72 | +support is maturing, with TGI and other inference servers now supporting Arc. |
| 73 | + |
| 74 | +**Pros:** Lower cost, reasonable power draw, easy to host |
| 75 | +**Cons:** Smaller ecosystem than NVIDIA, still maturing |
| 76 | + |
| 77 | +Intel's Gaudi accelerators are a separate, specialist option for data centre deployments - purpose-built for AI but not widely available outside cloud services. |
| 78 | + |
| 79 | +### Google TPUs |
| 80 | + |
| 81 | +Tensor Processing Units are Google's custom AI accelerators. Available through Google Cloud or as Edge TPUs for embedded use. Not typically used for on-premises self-hosting. |
| 82 | + |
| 83 | +**Pros:** Excellent performance for supported models |
| 84 | +**Cons:** Cloud-only for most use cases |
| 85 | + |
| 86 | +## Self-Hosting Software |
| 87 | + |
| 88 | +This section covers the most accessible options for running LLMs |
| 89 | +locally. TrustGraph has direct integrations for Ollama, vLLM, llama.cpp, TGI, |
| 90 | +and LM Studio. TrustGraph also supports the OpenAI API, which is commonly |
| 91 | +exposed by other self-hosting tools - so if your preferred option isn't |
| 92 | +listed, there's a good chance it likely still works. |
| 93 | + |
| 94 | +### Ollama |
| 95 | + |
| 96 | +The easiest way to get started. Ollama is a lightweight tool that handles |
| 97 | +model downloads, GPU detection, and serving - all through a simple |
| 98 | +interface. It manages the complexity of model formats and GPU configuration |
| 99 | +for you. Available for Linux, macOS, and Windows. Supports NVIDIA, AMD (ROCm), |
| 100 | +and Apple Metal GPUs. Uses GGUF model format. MIT licence. |
| 101 | + |
| 102 | +**Best for:** Getting started quickly, simple setups, learning |
| 103 | + |
| 104 | +### llama.cpp / llamafile / llama-server |
| 105 | + |
| 106 | +These are related projects built on the same foundation. **llama.cpp** is a |
| 107 | +C++ library that runs LLMs efficiently with minimal |
| 108 | +dependencies. **llamafile** packages a model and the runtime into a single |
| 109 | +executable - download one file and run it, nothing to |
| 110 | +install. **llama-server** is an HTTP server that exposes an OpenAI-compatible |
| 111 | +API. |
| 112 | + |
| 113 | +The llama.cpp ecosystem is lightweight and portable. It works on CPU (slower |
| 114 | +but no GPU required) or GPU (fast). A good choice when you want something |
| 115 | +minimal with few dependencies. Has the broadest GPU support: NVIDIA, AMD |
| 116 | +(ROCm), Apple Metal, and Vulkan for other GPUs. Uses GGUF model format. MIT |
| 117 | +licence. |
| 118 | + |
| 119 | +**Best for:** Lightweight deployments, CPU fallback, maximum portability |
| 120 | + |
| 121 | +### vLLM |
| 122 | + |
| 123 | +A high-performance inference server built for production workloads. vLLM |
| 124 | +implements optimisations like continuous batching (processing multiple |
| 125 | +requests efficiently) and PagedAttention (better memory management) to |
| 126 | +maximise throughput. If you need to serve many concurrent users or process |
| 127 | +high volumes, vLLM is designed for that. |
| 128 | + |
| 129 | +NVIDIA support is in the main distribution. AMD/ROCm support was tracked as a |
| 130 | +fork but is being merged into the main distribution. Intel support is |
| 131 | +available via Intel-maintained forks. Uses safetensors/Hugging Face model |
| 132 | +format. Apache 2.0 licence. |
| 133 | + |
| 134 | +**Best for:** Production deployments, high throughput, concurrent users |
| 135 | + |
| 136 | +### Text Generation Inference (TGI) |
| 137 | + |
| 138 | +Hugging Face's production inference server. Similar goals to vLLM - optimised |
| 139 | +for throughput and low latency. Tight integration with the Hugging Face model |
| 140 | +hub makes it easy to deploy models hosted there. NVIDIA support in the main |
| 141 | +distribution; Intel GPU support via Intel's contributions. Uses |
| 142 | +safetensors/Hugging Face model format. Apache 2.0 licence. |
| 143 | + |
| 144 | +**Best for:** Hugging Face ecosystem, production deployments, Intel hardware |
| 145 | + |
| 146 | +### LM Studio |
| 147 | + |
| 148 | +A desktop application with a graphical interface. You can browse and download |
| 149 | +models from a built-in catalogue, adjust parameters like temperature and |
| 150 | +context length, and run a local server - all without touching the command |
| 151 | +line. Supports NVIDIA, AMD, and Apple Metal GPUs. Uses GGUF model format. Free |
| 152 | +for personal use; commercial use requires a paid licence. |
| 153 | + |
| 154 | +**Best for:** Users who prefer a GUI, experimentation, non-technical users |
| 155 | + |
| 156 | +### A note on performance |
| 157 | + |
| 158 | +Performance of self-hosted inference is heavily affected by GPU-specific |
| 159 | +optimisations. vLLM and TGI have the most mature support for these |
| 160 | +optimisations, which is why they're the preferred choice for production |
| 161 | +deployments. |
| 162 | + |
0 commit comments