This project serves as a lightweight LLM serving engine for Qwen models, using FastAPI and PyTorch. It supports continuous batching and paged attention for efficient and scalable inference.
- FastAPI for HTTP serving
- Streaming and non-streaming responses
- Continuous batching
- Paged attention
- Python 3.8+
-
Clone the repository:
git clone <repository-url> cd qwen-serving-engine
-
Install the dependencies:
pip install -r requirements.txt
Start the server using Uvicorn:
uvicorn main:app --host 0.0.0.0 --port 8000 --log-level infoGenerate text based on a given prompt.
- Request:
GenerationRequest - Response:
GenerationResponse
OpenAI-compatible endpoint for chat completions.
- Request:
ChatCompletionRequest - Response:
ChatCompletionResponse
Here's an example using curl to make a request to the /generate endpoint:
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, how are you?", "max_tokens": 50, "temperature": 0.7}'Run the tests using pytest to ensure everything works correctly:
pytest tests/