A FastAPI wrapper around google/gemma-3-4b-it using vLLM for high-performance inference with low latency.
pip install -r requirements.txtWorks on a GPU-enabled machine with vLLM installed
Start FastAPI server:
uvicorn app.server:app --host 0.0.0.0 --port 8000curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"input": "Explain how gravity works."}'{
"output": "Gravity is a force that attracts objects with mass toward each other...",
"latency": 1.94
}