A high-performance, OpenAI-compatible API server for the nano Qwen3 serving engine, supporting multiple backends (CUDA, MPS, CPU) for efficient local LLM inference.
- π OpenAI-Compatible API: Drop-in replacement for OpenAI API
- β‘ Real-time Streaming: Server-sent events for live token generation
- π§ Multi-Backend Support: CUDA (NVIDIA), MPS (Apple Silicon), CPU
- π― High Performance: Efficient memory management and request batching
- π§ Multiple Models: Support for various Qwen3 model sizes
- π Health Monitoring: Built-in health checks and performance statistics
- π Async Support: Full async/await support for high concurrency
- π‘οΈ Production Ready: Error handling, logging, and monitoring
- Python 3.8+
- CUDA: NVIDIA GPU with CUDA support
- MPS: Apple Silicon Mac (M1/M2/M3) with macOS 12.3+
- CPU: Any system with sufficient RAM
- 8GB+ RAM (16GB+ recommended)
# Clone the repository
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving
# Install dependencies
pip install -r requirements.txt# Start with auto-detection (recommended)
python tools/start_service.py
# Start with specific device
python tools/start_service.py --device cuda
python tools/start_service.py --device mps
python tools/start_service.py --device cpu
# Start on custom port
python tools/start_service.py --port 8001
# Start with different model
python tools/start_service.py --model Qwen/Qwen3-1.5B --device auto# Health check
curl -X GET http://127.0.0.1:8000/health
# List available models
curl -X GET http://127.0.0.1:8000/v1/models
# Basic chat completion
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"max_tokens": 50
}'POST /v1/chat/completions
Generate chat completions with conversation context.
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-0.6b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
"max_tokens": 100,
"temperature": 0.7,
"stream": false
}'Enable real-time token generation:
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": true,
"max_tokens": 100
}'POST /v1/completions
curl -X POST http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-0.6b",
"prompt": "The future of artificial intelligence is",
"max_tokens": 50,
"temperature": 0.7
}'- GET
/v1/models- List available models - GET
/health- Health check - GET
/stats- Performance statistics
import requests
# Chat completion
response = requests.post(
"http://127.0.0.1:8000/v1/chat/completions",
json={
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 50
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])import requests
import json
# Streaming chat completion
response = requests.post(
"http://127.0.0.1:8000/v1/chat/completions",
json={
"model": "qwen3-0.6b",
"messages": [
{"role": "user", "content": "Tell me a story"}
],
"stream": True,
"max_tokens": 100
},
headers={"Accept": "text/event-stream"},
stream=True
)
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data_str = line[6:]
if data_str == '[DONE]':
break
try:
chunk = json.loads(data_str)
if 'choices' in chunk and chunk['choices']:
delta = chunk['choices'][0].get('delta', {})
if 'content' in delta:
print(delta['content'], end='', flush=True)
except json.JSONDecodeError:
continueimport openai
# Configure to use local server
openai.api_base = "http://127.0.0.1:8000/v1"
openai.api_key = "dummy-key" # Not used by local server
# Use like OpenAI API
response = openai.ChatCompletion.create(
model="qwen3-0.6b",
messages=[
{"role": "user", "content": "Hello!"}
],
max_tokens=50
)
print(response.choices[0].message.content)| Option | Default | Description |
|---|---|---|
--host |
127.0.0.1 |
Host to bind to |
--port |
8000 |
Port to bind to |
--model |
Qwen/Qwen3-0.6B |
Model name or path |
--device |
mps |
Device (mps, cpu) |
--dtype |
float16 |
Data type |
--max-queue-size |
1000 |
Maximum request queue size |
--num-blocks |
1024 |
Number of memory blocks |
--block-size |
16 |
Block size |
--max-seq-length |
4096 |
Maximum sequence length |
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
string | required | Model name |
messages |
array | required | Chat messages |
max_tokens |
integer | 100 | Maximum tokens to generate |
temperature |
float | 1.0 | Sampling temperature (0-2) |
top_p |
float | 1.0 | Top-p sampling (0-1) |
stream |
boolean | false | Enable streaming |
stop |
string/array | null | Stop sequences |
presence_penalty |
float | 0.0 | Presence penalty (-2 to 2) |
frequency_penalty |
float | 0.0 | Frequency penalty (-2 to 2) |
| Model | Tokens/sec | Memory Usage | Latency |
|---|---|---|---|
| Qwen3-0.6B | ~25 | ~2GB | ~50ms |
| Qwen3-1.5B | ~15 | ~4GB | ~80ms |
| Qwen3-3B | ~8 | ~8GB | ~120ms |
The service uses efficient block-based memory management:
- Dynamic Allocation: Memory blocks allocated on-demand
- Garbage Collection: Automatic cleanup of unused blocks
- Cache Optimization: KV cache management for better performance
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β FastAPI Server β OpenAI Service β AsyncLLM β
β (HTTP/WebSocket)β (Request/Resp) β (Async Interface) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Core Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM (High-level) β LLMEngine (Orchestrator) β AsyncLLMEngine β
β (User Interface) β (Request Management) β (Async Engine) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Execution Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ModelRunner β DeviceManager β Scheduler β BlockManager β
β (Inference) β (CUDA/MPS/CPU) β (Queuing) β (Memory) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- FastAPI Server: HTTP/WebSocket endpoints (
/v1/chat/completions,/v1/completions) - OpenAI Service: OpenAI-compatible request/response handling
- AsyncLLM: High-level async interface for concurrent processing
- LLM: High-level synchronous interface for text generation
- LLMEngine: Core orchestrator managing requests, batching, and inference
- AsyncLLMEngine: Async wrapper for concurrent request handling
- ModelRunner: Model execution and inference on multiple backends
- DeviceManager: Multi-backend support (CUDA, MPS, CPU) with auto-detection
- Scheduler: Request queuing, prioritization, and batch scheduling
- BlockManager: Efficient KV cache memory management
See the examples/ directory for comprehensive usage examples:
examples/openai_client_examples.py- Complete client examplesexamples/basic_usage.py- Basic usage patternsexamples/streaming_example.py- Streaming examples
π Full Documentation: https://nano-qwen3-serving.readthedocs.io/
π Alternative: GitHub Pages
- Quick Start: Get up and running in minutes
- Installation: Detailed installation guide
- API Reference: Complete API documentation
- Examples: Usage examples
- Contributing: How to contribute
- Troubleshooting: Common issues and solutions
# Install in development mode
pip install -e .
# Run with auto-reload
python tools/start_service.py --reload# Run with multiple workers
python tools/start_service.py --workers 4 --host 0.0.0.0
# Behind reverse proxy (nginx)
location / {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
EXPOSE 8000
CMD ["python", "tools/start_service.py", "--host", "0.0.0.0"]-
Service won't start
- Check if port is already in use
- Verify model path is correct
- Ensure sufficient memory
-
Slow responses
- Check device (MPS recommended for Apple Silicon)
- Monitor memory usage
- Adjust batch size and queue settings
-
Streaming issues
- Ensure
Accept: text/event-streamheader - Check for network timeouts
- Verify client handles streaming properly
- Ensure
# Enable debug logging
python tools/start_service.py --log-level debug
# Check service status
curl -X GET http://127.0.0.1:8000/health
# Monitor performance
curl -X GET http://127.0.0.1:8000/statsWe welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Clone and setup
git clone https://github.com/hsliuustc/nano-qwen3-serving.git
cd nano-qwen3-serving
# Install development dependencies
pip install -r requirements.txt
pip install -e .
# Run tests
python -m pytest tests/
# Run linting
python -m flake8 nano_qwen3_serving/This project is licensed under the MIT License - see the LICENSE file for details.
- Qwen Team for the excellent Qwen3 models
- OpenAI for the API specification
- Apple for MPS acceleration
- FastAPI for the web framework
- PyTorch for the deep learning framework
- π§ Email: your-email@example.com
- π¬ Discussions: GitHub Discussions
- π Issues: GitHub Issues
Made with β€οΈ for the AI community