Skip to content

urvishdesai/ai-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemma 3 Inference API with vLLM

A FastAPI wrapper around google/gemma-3-4b-it using vLLM for high-performance inference with low latency.

Setup

pip install -r requirements.txt

Works on a GPU-enabled machine with vLLM installed

Run the server

Start FastAPI server:

uvicorn app.server:app --host 0.0.0.0 --port 8000

Example Request

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"input": "Explain how gravity works."}'

Response

{
  "output": "Gravity is a force that attracts objects with mass toward each other...",
  "latency": 1.94
}

About

Gemma 3 Inference API

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages