Skip to content

Latest commit

 

History

History
36 lines (25 loc) · 657 Bytes

File metadata and controls

36 lines (25 loc) · 657 Bytes

Gemma 3 Inference API with vLLM

A FastAPI wrapper around google/gemma-3-4b-it using vLLM for high-performance inference with low latency.

Setup

pip install -r requirements.txt

Works on a GPU-enabled machine with vLLM installed

Run the server

Start FastAPI server:

uvicorn app.server:app --host 0.0.0.0 --port 8000

Example Request

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"input": "Explain how gravity works."}'

Response

{
  "output": "Gravity is a force that attracts objects with mass toward each other...",
  "latency": 1.94
}