Gemma 3 Inference API with vLLM

A FastAPI wrapper around google/gemma-3-4b-it using vLLM for high-performance inference with low latency.

Setup

pip install -r requirements.txt

Works on a GPU-enabled machine with vLLM installed

Start FastAPI server:

uvicorn app.server:app --host 0.0.0.0 --port 8000

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"input": "Explain how gravity works."}'

{
  "output": "Gravity is a force that attracts objects with mass toward each other...",
  "latency": 1.94
}