Skip to content

Commit 1261696

Browse files
committed
Add LLMs on Kubernetes blog post
1 parent daba2c0 commit 1261696

File tree

5 files changed

+329
-1
lines changed

5 files changed

+329
-1
lines changed

src/assets/blog/chip.jpg

406 KB
Loading
Lines changed: 310 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,310 @@
1+
---
2+
title: Running LLMs on Kubernetes
3+
author: Lena Fuhrimann
4+
date: 2025-11-26
5+
tags: ["cloud", "infrastructure", "storage", "scaling", "serverless"]
6+
excerpt:
7+
"Follow Kurt's journey as he learns about essential cloud services including
8+
object storage, managed databases, serverless runtimes, and message queues
9+
while running a theater ticket shop."
10+
image: ../../assets/blog/chip.jpg
11+
---
12+
13+
Large language models (LLMs) power many modern apps. Chatbots, coding helpers,
14+
and document tools all use them. The question isn't whether you need LLMs, but
15+
how to run them well. Kubernetes helps you deploy and manage these heavy
16+
workloads next to your other services.
17+
18+
Running LLMs on Kubernetes gives you a few benefits. You get a standard way to
19+
deploy them. You can easily manage GPU resources. Furthermore, you can scale up
20+
when demand grows. Most importantly, though, you can keep your data private by
21+
hosting models yourself instead of calling external APIs.
22+
23+
Here, we'll look at two ways to deploy LLMs on Kubernetes. First, we'll cover
24+
**Ollama** for simple setups. Then, we'll explore **vLLM** Production Stack for
25+
high-traffic scenarios.
26+
27+
## Ollama
28+
29+
[Ollama](https://ollama.ai/) is popular because it's easy to use. You download a
30+
model, and it just works. The
31+
[Ollama Helm Chart](https://github.com/otwld/ollama-helm) brings this same ease
32+
to Kubernetes.
33+
34+
### What You Need
35+
36+
For CPU-only use, you need Kubernetes 1.16.0 or newer. For GPU support with
37+
NVIDIA or AMD cards, you need Kubernetes 1.26.0 or newer.
38+
39+
### How to Install
40+
41+
Add the Helm repository and install:
42+
43+
```bash
44+
helm repo add otwld https://helm.otwld.com/
45+
helm repo update
46+
helm install ollama otwld/ollama --namespace ollama --create-namespace
47+
```
48+
49+
This sets up Ollama with good defaults. The service runs on port `11434`. You
50+
can use the normal Ollama tools to talk to it.
51+
52+
To test your deployment, forward the port to your local machine and run a model:
53+
54+
```bash
55+
kubectl port-forward -n ollama svc/ollama 11434:11434
56+
```
57+
58+
Then, in another terminal, you can interact with Ollama:
59+
60+
```bash
61+
curl http://localhost:11434/api/generate -d '{
62+
"model": "llama3.2:1b",
63+
"prompt": "Why is the sky blue?",
64+
"stream": false
65+
}'
66+
```
67+
68+
### Adding GPU Support
69+
70+
To use a GPU, create a file called `values.yaml`:
71+
72+
```yaml
73+
ollama:
74+
gpu:
75+
enabled: true
76+
type: "nvidia"
77+
number: 1
78+
```
79+
80+
Then update the install:
81+
82+
```bash
83+
helm upgrade ollama otwld/ollama --namespace ollama --values values.yaml
84+
```
85+
86+
### Downloading Models Early
87+
88+
Ollama downloads models when you first ask for them. This can be slow. You can
89+
tell it to download models when the pod starts:
90+
91+
```yaml
92+
ollama:
93+
gpu:
94+
enabled: true
95+
type: "nvidia"
96+
number: 1
97+
models:
98+
pull:
99+
- llama3.2:1b
100+
```
101+
102+
### Making Custom Models
103+
104+
You can also make custom versions of models with different settings:
105+
106+
```yaml
107+
ollama:
108+
models:
109+
create:
110+
- name: llama3.2-1b-large-context
111+
template: |
112+
FROM llama3.2:1b
113+
PARAMETER num_ctx 32768
114+
run:
115+
- llama3.2-1b-large-context
116+
```
117+
118+
This creates a version of Llama 3.2 that can handle longer text.
119+
120+
### Opening Access from Outside
121+
122+
To let people reach the API from outside the cluster, add an Ingress:
123+
124+
```yaml
125+
ollama:
126+
models:
127+
pull:
128+
- llama3.2:1b
129+
ingress:
130+
enabled: true
131+
hosts:
132+
- host: ollama.example.com
133+
paths:
134+
- path: /
135+
pathType: Prefix
136+
```
137+
138+
Now you can reach the API at `ollama.example.com`.
139+
140+
### When to Use Ollama
141+
142+
Ollama is great when you want things simple. It works well for getting started
143+
fast, running different models without much setup, or when you don't need to
144+
handle lots of traffic. If you've used Ollama on your laptop, using it on
145+
Kubernetes will feel familiar.
146+
147+
## vLLM
148+
149+
[vLLM](https://github.com/vllm-project/vllm) is built for speed. It uses
150+
performance optimizations like
151+
[Paged Attention](https://huggingface.co/docs/text-generation-inference/en/conceptual/paged_attention),
152+
[Continuous Batching](https://huggingface.co/docs/transformers/main/en/continuous_batching),
153+
and
154+
[Prefix Caching](https://bentoml.com/llm/inference-optimization/prefix-caching)
155+
to handle many requests at once. The
156+
[vLLM Production Stack](https://github.com/vllm-project/production-stack) wraps
157+
vLLM in a Kubernetes-friendly package with routing, monitoring, and caching.
158+
159+
### How It Works
160+
161+
The stack has three main parts. First, serving engines run the LLMs. Second, a
162+
router sends requests to the right place. Third, monitoring tools (Prometheus
163+
and Grafana) show you what's happening.
164+
165+
This setup lets you grow from one instance to many without changing your app
166+
code. The router uses an API that works like OpenAI's, so you can swap it in
167+
easily.
168+
169+
### How to Install
170+
171+
Add the Helm repository and install with a config file:
172+
173+
```bash
174+
helm repo add vllm https://vllm-project.github.io/production-stack
175+
helm install vllm vllm/vllm-stack -f values.yaml
176+
```
177+
178+
A simple `values.yaml` looks like this:
179+
180+
```yaml
181+
servingEngineSpec:
182+
runtimeClassName: ""
183+
modelSpec:
184+
- name: "llama3"
185+
repository: "vllm/vllm-openai"
186+
tag: "latest"
187+
modelURL: "meta-llama/Llama-3.2-3B-Instruct"
188+
replicaCount: 1
189+
requestCPU: 6
190+
requestMemory: "16Gi"
191+
requestGPU: 1
192+
```
193+
194+
After it's ready, you'll see two pods. One is the router. One runs the model:
195+
196+
```
197+
NAME READY STATUS AGE
198+
vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 2m
199+
vllm-llama3-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 2m
200+
```
201+
202+
### Using the API
203+
204+
Forward the router to your machine:
205+
206+
```bash
207+
kubectl port-forward svc/vllm-router-service 30080:80
208+
```
209+
210+
Check which models are available:
211+
212+
```bash
213+
curl http://localhost:30080/v1/models
214+
```
215+
216+
Send a chat message:
217+
218+
```bash
219+
curl -X POST http://localhost:30080/v1/chat/completions \
220+
-H "Content-Type: application/json" \
221+
-d '{
222+
"model": "meta-llama/Llama-3.2-3B-Instruct",
223+
"messages": [{"role": "user", "content": "Why is the sky blue?"}]
224+
}'
225+
```
226+
227+
### Running More Than One Model
228+
229+
You can run different models at the same time. The router sends each request to
230+
the right one:
231+
232+
```yaml
233+
servingEngineSpec:
234+
modelSpec:
235+
- name: "llama3"
236+
repository: "vllm/vllm-openai"
237+
tag: "latest"
238+
modelURL: "meta-llama/Llama-3.2-3B-Instruct"
239+
replicaCount: 1
240+
requestCPU: 6
241+
requestMemory: "16Gi"
242+
requestGPU: 1
243+
- name: "mistral"
244+
repository: "vllm/vllm-openai"
245+
tag: "latest"
246+
modelURL: "mistralai/Mistral-7B-Instruct-v0.3"
247+
replicaCount: 1
248+
requestCPU: 6
249+
requestMemory: "24Gi"
250+
requestGPU: 1
251+
```
252+
253+
### Logging In to Hugging Face
254+
255+
Some models need a Hugging Face account. You can add your token like this:
256+
257+
```yaml
258+
servingEngineSpec:
259+
modelSpec:
260+
- name: "llama3"
261+
repository: "vllm/vllm-openai"
262+
tag: "latest"
263+
modelURL: "meta-llama/Llama-3.2-3B-Instruct"
264+
replicaCount: 1
265+
requestCPU: 6
266+
requestMemory: "16Gi"
267+
requestGPU: 1
268+
env:
269+
- name: HF_TOKEN
270+
value: "your-huggingface-token"
271+
```
272+
273+
For real deployments, store the token in a Kubernetes Secret instead.
274+
275+
### Watching How It Runs
276+
277+
The stack comes with a Grafana dashboard. It shows you how many instances are
278+
healthy, how fast requests finish, how long users wait for the first response,
279+
how many requests are running or waiting, and how much GPU memory the cache
280+
uses. This helps you spot problems and plan for growth.
281+
282+
### When to Use vLLM
283+
284+
Use vLLM and its production stack when you need to handle lots of requests fast.
285+
Its router is smart about reusing cached work, which saves time and money. The
286+
OpenAI-style API makes it easy to plug into existing apps. The monitoring tools
287+
help you run it well in production.
288+
289+
## Wrapping Up
290+
291+
Ollama and vLLM serve different needs.
292+
293+
Ollama with its Helm chart gets you running fast with little setup. It's good
294+
for development, lighter workloads, and teams that want things simple.
295+
296+
vLLM Production Stack gives you the tools for heavy traffic. The router,
297+
multi-model support, and monitoring make it fit for production where speed and
298+
uptime matter.
299+
300+
Both use standard Kubernetes and Helm, so they'll feel familiar if you know
301+
containers. Pick based on how much traffic you expect and how much complexity
302+
you're willing to manage.
303+
304+
> “Ollama runs one, vLLM a batch — take your time, pick a match”
305+
>
306+
> Lena F.
307+
308+
Do you want to run your own LLMs on Kubernetes? Check out our
309+
[Pragmatic AI Adoption](/en/services/pragmatic-ai-adoption?utm_source=bespinian_blog&utm_medium=blog&utm_campaign=llms-on-kubernetes)
310+
service and get your free workshop.

src/content/blog/tough-applications.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,3 +196,7 @@ tolerance) but recover right away (high resilience).
196196
Understanding these four parts helps you find where your apps are weak. Then you
197197
can focus on making them stronger. The cloud is not a stable place. Building
198198
tough apps is not optional.
199+
200+
Do you want to build your own tough applications? Check out our
201+
[Cloud Native Empowerment](/en/services/cloud-native-empowerment?utm_source=bespinian_blog&utm_medium=blog&utm_campaign=tough-applications)
202+
service and get your free workshop.

src/layouts/ContentDetailPage.astro

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,7 +211,10 @@ const {
211211

212212
.content-body :global(blockquote) {
213213
border-left: 4px solid var(--color-primary);
214-
padding-left: 1.5rem;
214+
background: var(--color-blockquote-bg);
215+
padding: 1.5rem;
216+
padding-bottom: 0.5rem;
217+
font-size: 1.25rem;
215218
margin: 1.5rem 0;
216219
color: var(--color-gray);
217220
font-style: italic;
@@ -250,6 +253,12 @@ const {
250253
background: #f9f9f9;
251254
}
252255

256+
:global(.content-body img) {
257+
max-width: 100%;
258+
height: auto;
259+
margin: 2rem 0;
260+
}
261+
253262
@media (max-width: 48rem) {
254263
.content-detail {
255264
padding: var(--spacing-md) 0;

src/pages/[lang]/customers/[...slug].astro

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,11 @@ const canonicalUrl = new URL(Astro.url.pathname, Astro.site).toString();
179179
color: var(--color-gray);
180180
}
181181

182+
:global(.content-body img) {
183+
max-width: 100%;
184+
margin: 2rem 0;
185+
}
186+
182187
@media (max-width: 48rem) {
183188
.results-highlight {
184189
padding: 1.5rem;

0 commit comments

Comments
 (0)