| tags |
|
||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| category | project | ||||||||||||||||||||||||||||||
| type | implementation_guide | ||||||||||||||||||||||||||||||
| status | planning | ||||||||||||||||||||||||||||||
| priority | high | ||||||||||||||||||||||||||||||
| location | at_home | ||||||||||||||||||||||||||||||
| project_type | career | ||||||||||||||||||||||||||||||
| created | 2025-11-26 | ||||||||||||||||||||||||||||||
| updated | 2025-12-20 | ||||||||||||||||||||||||||||||
| repo_name | llm-cloud-inference | ||||||||||||||||||||||||||||||
| milestones |
|
||||||||||||||||||||||||||||||
| success_metrics |
|
||||||||||||||||||||||||||||||
| related_knowledge |
|
Part of: Career Development → AI/ML Infrastructure Skills
Goal: Deploy a self-hosted Large Language Model with OpenAI-compatible API endpoints on Google Cloud Run, optimized for cost-efficiency and internet accessibility. Benchmark latency against commercial APIs.
This project builds hands-on experience with:
- LLM Deployment: Production-grade model serving with modern inference engines
- Cloud Infrastructure: Google Cloud Run with GPU support
- API Design: OpenAI-compatible endpoints
- Cost Optimization: Serverless architecture with scale-to-zero
- Performance Engineering: Latency benchmarking and optimization
- High-Demand Skills: LLM deployment is a critical skill for AI/ML engineers
- Portfolio Project: Demonstrates end-to-end infrastructure capabilities
- Practical Experience: Real-world cost optimization and performance tuning
- Interview Ready: Concrete experience to discuss in technical interviews
- Benchmarking Expertise: Understanding latency trade-offs is key for production systems
Rationale: Clear, descriptive, and follows conventions for cloud-deployed ML projects.
llm-cloud-inference/
├── README.md # Project overview and quick start
├── LICENSE # MIT or Apache 2.0
├── .gitignore
├── .github/
│ └── workflows/
│ ├── build.yml # CI: Build and push container
│ └── deploy.yml # CD: Deploy to Cloud Run
├── docker/
│ ├── Dockerfile # Main container definition
│ ├── Dockerfile.dev # Local development container
│ └── startup.sh # Container startup script
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── health.py # Health check endpoints
│ └── middleware.py # API key auth, rate limiting
├── scripts/
│ ├── upload_model.sh # Upload model to GCS
│ ├── deploy.sh # Deploy to Cloud Run
│ ├── benchmark_latency.py # Latency comparison script
│ └── download_and_quantize.py # Model preparation
├── benchmarks/
│ ├── results/ # Benchmark output files
│ └── report_template.md # Benchmark report template
├── terraform/ # Optional: IaC for GCP resources
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── docs/
│ ├── ARCHITECTURE.md # Detailed architecture decisions
│ ├── DEPLOYMENT.md # Deployment guide
│ └── BENCHMARKS.md # Benchmark methodology
├── requirements.txt # Python dependencies
├── pyproject.toml # Project metadata
└── Makefile # Common commands
# Initialize repository
git init llm-cloud-inference
cd llm-cloud-inference
git remote add origin git@github.com:YOUR_USERNAME/llm-cloud-inference.git
# Branch strategy
main # Production-ready code
develop # Integration branch
feature/* # Feature branchesBased on Awesome-local-LLM research:
| Engine | Throughput | Memory Efficiency | OpenAI Compatible | Cloud Run Ready | Best For |
|---|---|---|---|---|---|
| vLLM | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ Native | ✅ Yes | Production serving |
| sglang | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ Yes | ✅ Yes | Multi-modal, fast |
| llama.cpp | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ Via wrapper | ✅ Yes | CPU/Low memory |
| Nano-vLLM | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ Yes | ✅ Yes | Minimal footprint |
| Ollama | ⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ Yes | Local dev |
vLLM Advantages:
- PagedAttention for optimal memory usage
- Continuous batching for high throughput
- Native OpenAI-compatible API
- Active community (PyTorch Foundation project as of May 2025)
- Well-documented Cloud Run deployment
sglang Alternative:
- Faster for certain workloads
- Better multi-modal support
- Good for vision-language models
If prioritizing...
├── Maximum throughput → vLLM
├── Vision + Language → sglang
├── Minimal memory → llama.cpp (GGUF)
├── Simplest setup → Ollama
└── Cutting-edge features → sglang
Based on Awesome-local-LLM model collections:
| GPU | VRAM | Recommended Models | Quantization |
|---|---|---|---|
| T4 | 16GB | Phi-4-mini, Gemma-3-4B, Qwen3-4B | 4-bit |
| L4 | 24GB | Qwen3-8B, Gemma-3-12B, Llama-3.1-8B | 4-bit or 8-bit |
| A100 40GB | 40GB | Qwen3-32B, Llama-3.1-70B | 4-bit |
| A100 80GB | 80GB | Full precision for most models | FP16 |
General Purpose (Cost-Optimized):
Primary: Qwen3-8B-Instruct (Q4_K_M)
- Excellent reasoning
- Multilingual
- Fits on L4 with room for context
Alternative: Gemma-3-12B-Instruct
- Google's latest
- Strong coding capabilities
- Efficient architectureCode-Focused:
Primary: Qwen3-Coder-8B (Q4_K_M)
- Purpose-built for code
- Strong at agentic tasks
Alternative: Devstral-Small
- Mistral's code model
- Good at software engineeringLightweight/Fast:
Primary: Phi-4-mini-instruct
- Microsoft's efficient model
- Surprisingly capable
- Very fast inference# scripts/download_and_quantize.py
#!/usr/bin/env python3
"""Download and optionally quantize models for deployment."""
import os
from huggingface_hub import snapshot_download
from pathlib import Path
# Configuration
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct" # Or your chosen model
LOCAL_DIR = Path("./models")
QUANTIZATION = "awq" # Options: none, awq, gptq, gguf
def download_model():
"""Download model from Hugging Face."""
print(f"Downloading {MODEL_ID}...")
snapshot_download(
repo_id=MODEL_ID,
local_dir=LOCAL_DIR / MODEL_ID.split("/")[-1],
local_dir_use_symlinks=False,
)
print("Download complete!")
def get_quantized_model():
"""Get pre-quantized model if available."""
# Many models have pre-quantized versions
quantized_id = f"{MODEL_ID}-AWQ" # or -GPTQ
print(f"Downloading pre-quantized {quantized_id}...")
snapshot_download(
repo_id=quantized_id,
local_dir=LOCAL_DIR / quantized_id.split("/")[-1],
local_dir_use_symlinks=False,
)
if __name__ == "__main__":
download_model()| Format | Size Reduction | Quality Loss | vLLM Support | Best For |
|---|---|---|---|---|
| FP16 | Baseline | None | ✅ | Maximum quality |
| INT8 | 50% | Minimal | ✅ | Balanced |
| AWQ 4-bit | 75% | Low | ✅ | Production |
| GPTQ 4-bit | 75% | Low | ✅ | Production |
| GGUF Q4_K_M | 75% | Low | CPU/Edge |
Why AWQ:
- Activation-aware: preserves important weights
- Excellent quality retention at 4-bit
- Native vLLM support
- Fast inference
Pre-quantized Models (use these to skip quantization):
Qwen/Qwen2.5-7B-Instruct-AWQTheBloke/Llama-2-7B-Chat-AWQ- Most popular models have AWQ versions on HuggingFace
# Approximate VRAM calculation
def estimate_vram_gb(params_billions: float, precision: str) -> float:
"""Estimate VRAM needed for a model."""
bytes_per_param = {
"fp32": 4,
"fp16": 2,
"int8": 1,
"int4": 0.5,
}
base_vram = params_billions * bytes_per_param[precision]
# Add ~20% overhead for KV cache, activations
return base_vram * 1.2
# Examples:
# 7B model at FP16: 7 * 2 * 1.2 = 16.8 GB
# 7B model at INT4: 7 * 0.5 * 1.2 = 4.2 GB┌─────────────────────────────────────────────────────────────────┐
│ Your Local Machine │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Your Application / Benchmark Script │ │
│ │ - Python with openai SDK │ │
│ │ - Latency measurement instrumentation │ │
│ └──────────────────┬───────────────────────────────────────┘ │
└─────────────────────┼───────────────────────────────────────────┘
│
│ HTTPS Request (OpenAI-compatible)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Google Cloud Platform (GCP) │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Cloud Run Service │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Container Instance │ │ │
│ │ │ ┌────────────────────────────────────────────────┐ │ │ │
│ │ │ │ vLLM Server │ │ │ │
│ │ │ │ - OpenAI-compatible /v1/chat/completions │ │ │ │
│ │ │ │ - /v1/completions │ │ │ │
│ │ │ │ - /v1/models │ │ │ │
│ │ │ │ - /health (for Cloud Run probes) │ │ │ │
│ │ │ └────────────────────────────────────────────────┘ │ │ │
│ │ │ GPU: NVIDIA L4 (24GB VRAM) │ │ │
│ │ │ Model: Qwen2.5-7B-Instruct-AWQ (4-bit, ~4GB) │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ Config: min_instances=0, max_instances=3, timeout=3600s │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Downloads model (cold start) │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Google Cloud Storage (GCS) │ │
│ │ Bucket: llm-cloud-inference-models │ │
│ │ └── qwen2.5-7b-instruct-awq/ │ │
│ │ ├── config.json │ │
│ │ ├── model.safetensors │ │
│ │ └── tokenizer.json │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Observability Stack (Optional but Recommended) │ │
│ │ - Cloud Monitoring (metrics) │ │
│ │ - Cloud Logging (structured logs) │ │
│ │ - Langfuse (LLM-specific observability) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
# docker/Dockerfile
FROM vllm/vllm-openai:v0.6.4.post1
# Install GCS access tools
RUN apt-get update && apt-get install -y --no-install-recommends \
google-cloud-cli \
&& rm -rf /var/lib/apt/lists/*
# Set environment variables
ENV MODEL_NAME="qwen2.5-7b-instruct-awq"
ENV GCS_BUCKET="llm-cloud-inference-models"
ENV PORT=8080
ENV VLLM_ATTENTION_BACKEND=FLASHINFER
# Create model directory
RUN mkdir -p /models
# Copy startup script
COPY startup.sh /startup.sh
RUN chmod +x /startup.sh
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=120s --retries=3 \
CMD curl -f http://localhost:${PORT}/health || exit 1
EXPOSE ${PORT}
ENTRYPOINT ["/startup.sh"]#!/bin/bash
# docker/startup.sh
set -e
echo "=== LLM Cloud Inference Startup ==="
echo "Model: ${MODEL_NAME}"
echo "GCS Bucket: ${GCS_BUCKET}"
echo "Port: ${PORT}"
# Download model from GCS if not already present
MODEL_PATH="/models/${MODEL_NAME}"
if [ ! -d "${MODEL_PATH}" ] || [ -z "$(ls -A ${MODEL_PATH} 2>/dev/null)" ]; then
echo "Downloading model from GCS..."
mkdir -p "${MODEL_PATH}"
gcloud storage cp -r "gs://${GCS_BUCKET}/${MODEL_NAME}/*" "${MODEL_PATH}/"
echo "Model download complete!"
else
echo "Model already cached locally."
fi
# Calculate optimal settings based on GPU
GPU_MEMORY=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -1)
echo "GPU Memory: ${GPU_MEMORY}MB"
# Start vLLM server with optimized settings
echo "Starting vLLM server..."
exec python -m vllm.entrypoints.openai.api_server \
--model "${MODEL_PATH}" \
--host 0.0.0.0 \
--port ${PORT} \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--dtype auto \
--trust-remote-code \
--disable-log-requests \
--served-model-name "${MODEL_NAME}"# src/config.py
"""Configuration management for LLM Cloud Inference."""
import os
from dataclasses import dataclass
from typing import Optional
@dataclass
class Config:
"""Application configuration."""
# Model settings
model_name: str = os.getenv("MODEL_NAME", "qwen2.5-7b-instruct-awq")
gcs_bucket: str = os.getenv("GCS_BUCKET", "llm-cloud-inference-models")
model_path: str = os.getenv("MODEL_PATH", "/models")
# Server settings
port: int = int(os.getenv("PORT", "8080"))
host: str = os.getenv("HOST", "0.0.0.0")
# vLLM settings
gpu_memory_utilization: float = float(os.getenv("GPU_MEMORY_UTIL", "0.90"))
max_model_len: int = int(os.getenv("MAX_MODEL_LEN", "8192"))
# Security
api_key: Optional[str] = os.getenv("API_KEY")
require_auth: bool = os.getenv("REQUIRE_AUTH", "false").lower() == "true"
@property
def full_model_path(self) -> str:
return f"{self.model_path}/{self.model_name}"
config = Config()#!/bin/bash
# scripts/deploy.sh
set -e
# Configuration
PROJECT_ID="${GCP_PROJECT_ID:?Set GCP_PROJECT_ID}"
REGION="${GCP_REGION:-us-central1}"
SERVICE_NAME="llm-api"
IMAGE_NAME="gcr.io/${PROJECT_ID}/llm-cloud-inference:latest"
echo "=== Deploying LLM Cloud Inference ==="
echo "Project: ${PROJECT_ID}"
echo "Region: ${REGION}"
echo "Image: ${IMAGE_NAME}"
# Build and push container
echo "Building container..."
docker build -t ${IMAGE_NAME} -f docker/Dockerfile .
echo "Pushing to GCR..."
docker push ${IMAGE_NAME}
# Deploy to Cloud Run
echo "Deploying to Cloud Run..."
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE_NAME} \
--platform managed \
--region ${REGION} \
--gpu 1 \
--gpu-type nvidia-l4 \
--cpu 8 \
--memory 32Gi \
--timeout 3600 \
--min-instances 0 \
--max-instances 3 \
--concurrency 10 \
--port 8080 \
--set-env-vars "MODEL_NAME=qwen2.5-7b-instruct-awq,GCS_BUCKET=llm-cloud-inference-models" \
--service-account "llm-inference@${PROJECT_ID}.iam.gserviceaccount.com" \
--allow-unauthenticated
# Get service URL
SERVICE_URL=$(gcloud run services describe ${SERVICE_NAME} --region ${REGION} --format 'value(status.url)')
echo "=== Deployment Complete ==="
echo "Service URL: ${SERVICE_URL}"
echo "Test with: curl ${SERVICE_URL}/v1/models"# Makefile
.PHONY: help setup download-model build push deploy test benchmark clean
PROJECT_ID ?= $(shell gcloud config get-value project)
REGION ?= us-central1
IMAGE_NAME = gcr.io/$(PROJECT_ID)/llm-cloud-inference
MODEL_NAME ?= qwen2.5-7b-instruct-awq
help:
@echo "LLM Cloud Inference - Available commands:"
@echo " make setup - Set up GCP project and enable APIs"
@echo " make download-model - Download and prepare model"
@echo " make upload-model - Upload model to GCS"
@echo " make build - Build Docker container"
@echo " make push - Push to Google Container Registry"
@echo " make deploy - Deploy to Cloud Run"
@echo " make test - Test deployed API"
@echo " make benchmark - Run latency benchmarks"
@echo " make logs - View Cloud Run logs"
@echo " make clean - Clean up local resources"
setup:
@echo "Setting up GCP project..."
gcloud services enable run.googleapis.com
gcloud services enable containerregistry.googleapis.com
gcloud services enable storage.googleapis.com
gcloud services enable cloudbuild.googleapis.com
@echo "Creating GCS bucket..."
gsutil mb -l $(REGION) gs://llm-cloud-inference-models || true
@echo "Creating service account..."
gcloud iam service-accounts create llm-inference --display-name "LLM Inference Service" || true
gcloud projects add-iam-policy-binding $(PROJECT_ID) \
--member="serviceAccount:llm-inference@$(PROJECT_ID).iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
download-model:
python scripts/download_and_quantize.py
upload-model:
@echo "Uploading model to GCS..."
gsutil -m cp -r ./models/$(MODEL_NAME)/* gs://llm-cloud-inference-models/$(MODEL_NAME)/
build:
docker build -t $(IMAGE_NAME):latest -f docker/Dockerfile .
push: build
docker push $(IMAGE_NAME):latest
deploy: push
./scripts/deploy.sh
test:
@SERVICE_URL=$$(gcloud run services describe llm-api --region $(REGION) --format 'value(status.url)'); \
echo "Testing $$SERVICE_URL..."; \
curl -s "$$SERVICE_URL/v1/models" | jq .
benchmark:
python scripts/benchmark_latency.py
logs:
gcloud run services logs read llm-api --region $(REGION) --limit 100
clean:
rm -rf ./models/*
docker rmi $(IMAGE_NAME):latest || true#!/usr/bin/env python3
# scripts/benchmark_latency.py
"""
Benchmark latency comparison: Self-hosted LLM vs Commercial APIs
Measures: TTFT, TPS, Total latency, Cost per token
"""
import asyncio
import json
import os
import statistics
import time
from dataclasses import dataclass, field
from datetime import datetime
from pathlib import Path
from typing import Optional
import httpx
from openai import OpenAI, AsyncOpenAI
# =============================================================================
# Configuration
# =============================================================================
@dataclass
class BenchmarkConfig:
"""Benchmark configuration."""
# Endpoints
self_hosted_url: str = os.getenv("SELF_HOSTED_URL", "")
openai_api_key: str = os.getenv("OPENAI_API_KEY", "")
anthropic_api_key: str = os.getenv("ANTHROPIC_API_KEY", "")
# Test parameters
num_runs: int = 10
warmup_runs: int = 2
max_tokens: int = 256
temperature: float = 0.7
# Test prompts (varying complexity)
prompts: list = field(default_factory=lambda: [
# Short prompt
"What is 2+2?",
# Medium prompt
"Explain the concept of machine learning in 3 sentences.",
# Long prompt
"Write a detailed comparison of Python and JavaScript for web development, covering syntax, performance, ecosystem, and use cases.",
])
@dataclass
class LatencyResult:
"""Single request latency measurements."""
provider: str
model: str
prompt_tokens: int
completion_tokens: int
ttft_ms: float # Time to first token
total_ms: float # Total request time
tps: float # Tokens per second
cost_usd: float # Estimated cost
error: Optional[str] = None
@dataclass
class BenchmarkReport:
"""Aggregated benchmark results."""
provider: str
model: str
num_requests: int
avg_ttft_ms: float
p50_ttft_ms: float
p95_ttft_ms: float
avg_total_ms: float
p50_total_ms: float
p95_total_ms: float
avg_tps: float
total_cost_usd: float
error_rate: float
# =============================================================================
# Pricing (as of Dec 2025)
# =============================================================================
PRICING = {
"gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000},
"gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
"claude-3-5-sonnet": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
"claude-3-5-haiku": {"input": 0.80 / 1_000_000, "output": 4.00 / 1_000_000},
"self-hosted": {"input": 0, "output": 0}, # Cost calculated separately
}
# =============================================================================
# Benchmark Functions
# =============================================================================
async def benchmark_openai(
client: AsyncOpenAI,
model: str,
prompt: str,
config: BenchmarkConfig,
) -> LatencyResult:
"""Benchmark OpenAI API."""
start_time = time.perf_counter()
ttft = None
tokens_received = 0
try:
stream = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=config.max_tokens,
temperature=config.temperature,
stream=True,
)
async for chunk in stream:
if ttft is None:
ttft = (time.perf_counter() - start_time) * 1000
if chunk.choices[0].delta.content:
tokens_received += 1
total_time = (time.perf_counter() - start_time) * 1000
# Estimate prompt tokens (rough: 4 chars per token)
prompt_tokens = len(prompt) // 4
pricing = PRICING.get(model, PRICING["gpt-4o-mini"])
cost = (prompt_tokens * pricing["input"]) + (tokens_received * pricing["output"])
return LatencyResult(
provider="OpenAI",
model=model,
prompt_tokens=prompt_tokens,
completion_tokens=tokens_received,
ttft_ms=ttft or total_time,
total_ms=total_time,
tps=tokens_received / (total_time / 1000) if total_time > 0 else 0,
cost_usd=cost,
)
except Exception as e:
return LatencyResult(
provider="OpenAI",
model=model,
prompt_tokens=0,
completion_tokens=0,
ttft_ms=0,
total_ms=0,
tps=0,
cost_usd=0,
error=str(e),
)
async def benchmark_self_hosted(
client: AsyncOpenAI,
model: str,
prompt: str,
config: BenchmarkConfig,
) -> LatencyResult:
"""Benchmark self-hosted vLLM API."""
start_time = time.perf_counter()
ttft = None
tokens_received = 0
try:
stream = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=config.max_tokens,
temperature=config.temperature,
stream=True,
)
async for chunk in stream:
if ttft is None:
ttft = (time.perf_counter() - start_time) * 1000
if chunk.choices[0].delta.content:
tokens_received += 1
total_time = (time.perf_counter() - start_time) * 1000
prompt_tokens = len(prompt) // 4
return LatencyResult(
provider="Self-Hosted",
model=model,
prompt_tokens=prompt_tokens,
completion_tokens=tokens_received,
ttft_ms=ttft or total_time,
total_ms=total_time,
tps=tokens_received / (total_time / 1000) if total_time > 0 else 0,
cost_usd=0, # Calculated from Cloud Run billing
)
except Exception as e:
return LatencyResult(
provider="Self-Hosted",
model=model,
prompt_tokens=0,
completion_tokens=0,
ttft_ms=0,
total_ms=0,
tps=0,
cost_usd=0,
error=str(e),
)
def calculate_percentile(values: list[float], percentile: float) -> float:
"""Calculate percentile of a list of values."""
if not values:
return 0
sorted_values = sorted(values)
index = int(len(sorted_values) * percentile / 100)
return sorted_values[min(index, len(sorted_values) - 1)]
def aggregate_results(results: list[LatencyResult]) -> BenchmarkReport:
"""Aggregate individual results into a report."""
successful = [r for r in results if r.error is None]
if not successful:
return BenchmarkReport(
provider=results[0].provider if results else "Unknown",
model=results[0].model if results else "Unknown",
num_requests=len(results),
avg_ttft_ms=0,
p50_ttft_ms=0,
p95_ttft_ms=0,
avg_total_ms=0,
p50_total_ms=0,
p95_total_ms=0,
avg_tps=0,
total_cost_usd=0,
error_rate=1.0,
)
ttft_values = [r.ttft_ms for r in successful]
total_values = [r.total_ms for r in successful]
tps_values = [r.tps for r in successful]
return BenchmarkReport(
provider=successful[0].provider,
model=successful[0].model,
num_requests=len(results),
avg_ttft_ms=statistics.mean(ttft_values),
p50_ttft_ms=calculate_percentile(ttft_values, 50),
p95_ttft_ms=calculate_percentile(ttft_values, 95),
avg_total_ms=statistics.mean(total_values),
p50_total_ms=calculate_percentile(total_values, 50),
p95_total_ms=calculate_percentile(total_values, 95),
avg_tps=statistics.mean(tps_values),
total_cost_usd=sum(r.cost_usd for r in successful),
error_rate=(len(results) - len(successful)) / len(results),
)
async def run_benchmark(config: BenchmarkConfig) -> dict:
"""Run full benchmark suite."""
results = {
"timestamp": datetime.now().isoformat(),
"config": {
"num_runs": config.num_runs,
"max_tokens": config.max_tokens,
"prompts": config.prompts,
},
"reports": [],
}
# Initialize clients
clients = {}
if config.openai_api_key:
clients["openai"] = AsyncOpenAI(api_key=config.openai_api_key)
if config.self_hosted_url:
clients["self_hosted"] = AsyncOpenAI(
base_url=f"{config.self_hosted_url}/v1",
api_key="dummy", # vLLM doesn't require auth by default
)
# Run benchmarks
for prompt in config.prompts:
print(f"\n{'='*60}")
print(f"Prompt: {prompt[:50]}...")
print(f"{'='*60}")
# OpenAI benchmarks
if "openai" in clients:
for model in ["gpt-4o-mini", "gpt-4o"]:
print(f"\nBenchmarking OpenAI {model}...")
model_results = []
# Warmup
for _ in range(config.warmup_runs):
await benchmark_openai(clients["openai"], model, prompt, config)
# Actual runs
for i in range(config.num_runs):
result = await benchmark_openai(clients["openai"], model, prompt, config)
model_results.append(result)
print(f" Run {i+1}: TTFT={result.ttft_ms:.0f}ms, Total={result.total_ms:.0f}ms, TPS={result.tps:.1f}")
report = aggregate_results(model_results)
results["reports"].append({
"prompt_preview": prompt[:50],
**report.__dict__,
})
# Self-hosted benchmark
if "self_hosted" in clients:
print(f"\nBenchmarking Self-Hosted...")
model_results = []
model_name = os.getenv("MODEL_NAME", "qwen2.5-7b-instruct-awq")
# Warmup
for _ in range(config.warmup_runs):
await benchmark_self_hosted(clients["self_hosted"], model_name, prompt, config)
# Actual runs
for i in range(config.num_runs):
result = await benchmark_self_hosted(clients["self_hosted"], model_name, prompt, config)
model_results.append(result)
print(f" Run {i+1}: TTFT={result.ttft_ms:.0f}ms, Total={result.total_ms:.0f}ms, TPS={result.tps:.1f}")
report = aggregate_results(model_results)
results["reports"].append({
"prompt_preview": prompt[:50],
**report.__dict__,
})
return results
def print_summary(results: dict):
"""Print benchmark summary."""
print("\n" + "="*80)
print("BENCHMARK SUMMARY")
print("="*80)
# Group by provider
by_provider = {}
for report in results["reports"]:
provider = report["provider"]
if provider not in by_provider:
by_provider[provider] = []
by_provider[provider].append(report)
for provider, reports in by_provider.items():
print(f"\n{provider}:")
print("-" * 40)
avg_ttft = statistics.mean(r["avg_ttft_ms"] for r in reports)
avg_total = statistics.mean(r["avg_total_ms"] for r in reports)
avg_tps = statistics.mean(r["avg_tps"] for r in reports)
total_cost = sum(r["total_cost_usd"] for r in reports)
print(f" Avg TTFT: {avg_ttft:>8.1f} ms")
print(f" Avg Total: {avg_total:>8.1f} ms")
print(f" Avg TPS: {avg_tps:>8.1f} tokens/sec")
print(f" Total Cost: ${total_cost:>7.4f}")
def save_results(results: dict, output_dir: str = "benchmarks/results"):
"""Save benchmark results to JSON."""
Path(output_dir).mkdir(parents=True, exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = Path(output_dir) / f"benchmark_{timestamp}.json"
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
print(f"\nResults saved to: {output_file}")
async def main():
"""Main entry point."""
config = BenchmarkConfig()
if not config.self_hosted_url and not config.openai_api_key:
print("Error: Set SELF_HOSTED_URL or OPENAI_API_KEY environment variable")
return
print("Starting LLM Latency Benchmark...")
print(f"Configuration: {config.num_runs} runs, {config.max_tokens} max tokens")
results = await run_benchmark(config)
print_summary(results)
save_results(results)
if __name__ == "__main__":
asyncio.run(main())Based on typical observations:
| Provider | Model | TTFT (p50) | Total (p50) | TPS | Cost/1K tokens |
|---|---|---|---|---|---|
| OpenAI | gpt-4o-mini | 200-400ms | 1-3s | 50-100 | $0.075 |
| OpenAI | gpt-4o | 300-600ms | 2-5s | 40-80 | $6.25 |
| Self-hosted (warm) | Qwen-7B | 100-300ms | 1-2s | 30-60 | ~$0.01* |
| Self-hosted (cold) | Qwen-7B | 60-180s | 60-180s | - | ~$0.01* |
*Self-hosted cost = Cloud Run compute time, roughly $1-2/hour when active
- Cold start is the main trade-off: Self-hosted is cheaper but has significant cold start
- Warm latency is competitive: Once running, self-hosted can match or beat API latency
- Cost efficiency: At high volumes, self-hosted is 10-100x cheaper
- Control: Self-hosted offers full control over model, context, and data
Based on Awesome-local-LLM observability tools:
# Enable Cloud Monitoring
gcloud services enable monitoring.googleapis.com
# Create custom dashboard
gcloud monitoring dashboards create --config-from-file=monitoring/dashboard.jsonKey metrics to track:
- Request latency (p50, p95, p99)
- Request count by status code
- GPU utilization
- Memory utilization
- Cold start frequency
From Awesome-local-LLM: "langfuse - open-source LLM engineering platform"
# Integration with vLLM
from langfuse import Langfuse
from langfuse.openai import openai
langfuse = Langfuse(
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)
# Automatic tracing of OpenAI-compatible calls
client = openai.OpenAI(
base_url="https://your-cloud-run-url/v1",
api_key="dummy",
)
# All calls are now traced in Langfuse
response = client.chat.completions.create(...)Benefits:
- Token usage tracking
- Latency histograms
- Cost tracking
- Prompt versioning
- Evaluation pipelines
From Awesome-local-LLM: "openllmetry - observability for LLM applications based on OpenTelemetry"
from opentelemetry import trace
from traceloop.sdk import Traceloop
Traceloop.init(app_name="llm-cloud-inference")| GPU | vCPU | Memory | Price/hour | Monthly (24/7) |
|---|---|---|---|---|
| T4 | 4 | 16GB | ~$0.40 | ~$288 |
| L4 | 8 | 32GB | ~$0.90 | ~$648 |
| A100 40GB | 12 | 85GB | ~$2.50 | ~$1,800 |
- min_instances=0: Pay nothing when idle
- Use quantized models: Fit larger models on smaller GPUs
- Right-size GPU: T4 for <7B, L4 for 7B-13B, A100 for larger
- Set aggressive timeout: Prevent runaway costs
- Implement caching: Reduce redundant inference
OpenAI GPT-4o-mini: $0.15/1M input + $0.60/1M output ≈ $0.075/1K tokens
Self-hosted L4 GPU: $0.90/hour
At 50 tokens/second = 180K tokens/hour
Cost per 1K tokens = $0.90 / 180 = $0.005
Break-even: Self-hosted is cheaper at >~13 hours/month usage
- Research architecture options
- Document architecture decisions
- Create GitHub repository
llm-cloud-inference - Set up GCP project and enable APIs
- Configure billing alerts ($50 limit)
- Download Qwen2.5-7B-Instruct-AWQ
- Create GCS bucket
- Upload model to GCS
- Verify model accessibility
- Create Dockerfile
- Create startup script
- Test container locally with Docker + GPU
- Push to Google Container Registry
- Deploy to Cloud Run with GPU
- Configure environment variables
- Test API endpoints
- Measure cold start time
- Set up benchmark script
- Run latency comparison vs OpenAI
- Generate benchmark report
- Document findings
- Optimize based on benchmarks
- Set up observability
- Complete documentation
- Prepare portfolio presentation
# src/middleware.py
from fastapi import Request, HTTPException
from functools import wraps
async def verify_api_key(request: Request):
"""Verify API key from header."""
api_key = request.headers.get("Authorization", "").replace("Bearer ", "")
expected_key = os.getenv("API_KEY")
if expected_key and api_key != expected_key:
raise HTTPException(status_code=401, detail="Invalid API key")# Deploy with authentication required
gcloud run deploy llm-api \
--no-allow-unauthenticated \
--service-account llm-inference@${PROJECT_ID}.iam.gserviceaccount.com
# Generate identity token for access
gcloud auth print-identity-token# Store API keys in Secret Manager
gcloud secrets create llm-api-key --replication-policy="automatic"
echo -n "your-secret-key" | gcloud secrets versions add llm-api-key --data-file=-
# Reference in Cloud Run
gcloud run deploy llm-api \
--set-secrets="API_KEY=llm-api-key:latest"| Category | Skills |
|---|---|
| Cloud Infrastructure | GCP, Cloud Run, GCS, Container Registry |
| ML Engineering | LLM deployment, vLLM, model quantization |
| DevOps | Docker, CI/CD, IaC concepts |
| Performance Engineering | Latency benchmarking, optimization |
| API Design | OpenAI-compatible REST APIs |
| Cost Optimization | Serverless architecture, resource right-sizing |
-
"Deployed production LLM infrastructure achieving 10-100x cost reduction vs commercial APIs"
- Quantified cost comparison with benchmark data
-
"Implemented OpenAI-compatible API enabling zero-code migration for existing applications"
- Drop-in replacement architecture
-
"Optimized cold start latency from 5 minutes to <3 minutes through model quantization"
- AWQ 4-bit quantization, model selection
-
"Built comprehensive latency benchmarking suite comparing self-hosted vs commercial APIs"
- TTFT, TPS, total latency metrics
-
"Designed serverless GPU architecture with scale-to-zero for optimal cost efficiency"
- Cloud Run min_instances=0 pattern
# LLM Cloud Inference
Deploy your own OpenAI-compatible LLM API on Google Cloud Run with GPU support.
## Features
- 🚀 OpenAI-compatible API endpoints
- 💰 Scale-to-zero for cost optimization
- 📊 Comprehensive latency benchmarking
- 🔒 Optional authentication
- 📈 Built-in observability
## Quick Start
... (deployment commands)
## Benchmark Results
| Provider | TTFT (p50) | Cost/1K tokens |
|----------|------------|----------------|
| OpenAI gpt-4o-mini | 300ms | $0.075 |
| Self-hosted Qwen-7B | 200ms | $0.005 |
## Architecture
... (diagram)- vLLM - High-throughput inference engine
- sglang - Fast serving framework
- llama.cpp - CPU/edge inference
- Langfuse - LLM observability
- OpenLLMetry - OpenTelemetry for LLMs
- r/LocalLLaMA - Local LLM discussions
- r/selfhosted - Self-hosting community
Document Version: 2.0 Last Updated: December 2025 Source: Enhanced with Awesome-local-LLM research