Name	Name	Last commit message	Last commit date
parent directory ..
Dockerfile	Dockerfile
README.md	README.md
apply.sh	apply.sh
deployment.yaml	deployment.yaml

llama-server — Local LLM Inference on Kubernetes

Serves Qwen3.5-35B-A3B (Q8_0, 36.9GB) via llama.cpp's llama-server on Kubernetes with multi-GPU support. Provides an OpenAI-compatible API for use with Claude Code, Cursor, opencode, or any chat client.

Architecture

Claude Code / Cursor / Browser UI
            ↓
   http://localhost:30080/v1  (NodePort)
            ↓
┌──────────────────────────────────────────────┐
│  K8s Pod: llama-server                       │
│                                              │
│  Model: Qwen3.5-35B-A3B (Q8_0, 36.9GB)     │
│  GPU 0: RTX 5090  — 28.2 GB  (80% layers)  │
│  GPU 1: RTX 2070S —  6.5 GB  (20% layers)  │
│  API: OpenAI-compatible /v1/chat/completions │
│  Speed: ~75 tok/s generation                 │
└──────────────────────────────────────────────┘

Why This Stack

Qwen3.5-35B-A3B — MoE with 35B total / 3B active params per token. Best quality-per-FLOP at this size.
Q8_0 quantization — Near-lossless 8-bit, fits across two GPUs with room for KV cache.
llama-server — Lightweight single binary, native multi-GPU tensor splitting, OpenAI-compatible API, built-in chat UI.
Multi-GPU — --tensor-split 32,8 distributes layers proportionally across RTX 5090 (32GB) + RTX 2070 SUPER (8GB).

Files

File	Purpose
`Dockerfile`	Builds llama.cpp with CUDA from source
`deployment.yaml`	K8s Deployment + NodePort Service
`.env`	GPU UUIDs (gitignored)
`apply.sh`	Deploys with envsubst for GPU pinning

Setup

1. Build & Push Docker Image

cd workloads/llama-server
docker build -t localhost:5000/llama-server:latest .
docker push localhost:5000/llama-server:latest

2. Download Model

mkdir -p /home/akshay/llama-workspace/models
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    'unsloth/Qwen3.5-35B-A3B-GGUF',
    local_dir='/home/akshay/llama-workspace/models/Qwen3.5-35B-A3B-GGUF',
    allow_patterns=['*Q8_0*']
)
"

3. Configure `.env`

# .env
GPU_UUID=<5090-uuid>,<2070s-uuid>

4. Deploy

./apply.sh deployment.yaml

5. Verify

# Check pod
kubectl logs -f deployment/llama-server

# Test API
curl http://localhost:30080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen3.5-35B-A3B-Q8_0.gguf","messages":[{"role":"user","content":"Hello!"}]}'

Usage

Browser UI

Open http://localhost:30080 — llama-server has a built-in chat interface.

SSH Port Forward (Remote Access)

ssh -p 4224 -L 30080:localhost:30080 your_username@devserver.zosma.ai
# Then open http://localhost:30080 in your browser

Claude Code

Add to ~/.claude/settings.json (prevents KV cache invalidation):

{
  "env": {
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

Then run:

ANTHROPIC_BASE_URL="http://localhost:30080" \
ANTHROPIC_API_KEY="sk-no-key-required" \
claude --model Qwen3.5-35B-A3B-Q8_0.gguf

Cursor / opencode / Any OpenAI Client

Set base URL to http://localhost:30080/v1 and API key to any non-empty string.

Key Parameters

Parameter	Value	Notes
`--ctx-size`	16384	Reduced from 32K to fit VRAM budget
`--cache-type-k`	q8_0	Quantized KV cache to save VRAM
`--cache-type-v`	q8_0	Same
`--tensor-split`	32,8	VRAM ratio: 5090 gets 80%, 2070S gets 20%
`-ngl`	99	Offload all layers to GPU
`--temp`	0.6	Default sampling temperature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

llama-server — Local LLM Inference on Kubernetes

Architecture

Why This Stack

Files

Setup

1. Build & Push Docker Image

2. Download Model

3. Configure `.env`

4. Deploy

5. Verify

Usage

Browser UI

SSH Port Forward (Remote Access)

Claude Code

Cursor / opencode / Any OpenAI Client

Key Parameters

FilesExpand file tree

llama-server

Directory actions

More options

Directory actions

More options

Latest commit

History

llama-server

Folders and files

parent directory

README.md

llama-server — Local LLM Inference on Kubernetes

Architecture

Why This Stack

Files

Setup

1. Build & Push Docker Image

2. Download Model

3. Configure .env

4. Deploy

5. Verify

Usage

Browser UI

SSH Port Forward (Remote Access)

Claude Code

Cursor / opencode / Any OpenAI Client

Key Parameters

3. Configure `.env`