🚀 Unified Llama OpenAI-Compatible API Gateway

A comprehensive set of service configurations to deploy a high-performance, private AI infrastructure. This project leverages llama.cpp, whisper.cpp, and various specialized models to provide a unified OpenAI-compatible API interface through a KrakenD API Gateway.

🌟 Features

Full OpenAI Compatibility: Seamlessly use your favorite AI tools and clients.
Unified Gateway: Single entry point for completions, tools, embeddings, and audio services via KrakenD.
GPU Optimized: Configurations tuned for CUDA-accelerated inference.
Robust Deployment: Ready-to-use systemd service files for automated startup and recovery.
Diverse Model Support:
- 🧠 LLM: Qwen, DeepSeek, Llama.
- 🔍 Embeddings: BGE-M3, Nomic-Embed.
- 🔄 Reranking: BGE-Reranker.
- 🎙️ Audio: Whisper (Turbo, Large-v3) & XTTS/Silero TTS.

🏗️ Architecture

The infrastructure is split into individual microservices unified by a KrakenD gateway:

Service	Endpoint (Internal)	Purpose	Backend Engine
`aismart.service`	`:6150`	High-quality smart completion	`llama-server` (Qwen3.5-35B)
`aifast.service`	`:6155`	Performance-optimized completion	`llama-server` (Fast models)
`aicoder.service`	`:5000`	Specialized coding completions	`llama-server` (DeepSeek-Coder)
`aiembed.service`	`:5500`	Text embedding generation	`llama-server` (BGE-M3)
`airerank.service`	`:5550`	Search result reranking	`llama-server` (BGE-Reranker)
`whisper.service`	`:5005`	STT (Speech-to-Text)	`whisper-server` (Whisper Large-v3)
`xtts.service`	`:5050` / `:10200`	TTS (Text-to-Speech)	Silero / XTTS API Server

🛠️ Combined Endpoints (API Gateway)

The KrakenD Gateway (running on port 9000) exposes the following unified endpoints:

💬 Chat & Completions

POST /v1/chat/completions - Unified chat interface.
POST /v1/completions - Legacy completion support.
POST /v1/tools - Tool-use and function calling support.

🔍 Search & Retrieval

POST /v1/embeddings - Generate vector representations of text.
POST /v1/rerank - Rank documents based on query relevance.

🔊 Audio Services

POST /v1/audio/transcriptions - Convert audio to text.
POST /v1/audio/translations - Translate audio in real-time.
POST /v1/audio/speech - Convert text to natural-sounding audio.

🚀 Installation & Setup

1. Requirements

Linux OS (Ubuntu/Debian recommended)
NVIDIA GPU with CUDA drivers
llama.cpp and whisper.cpp compiled and available in /ai/
KrakenD installed

2. Service Deployment

Copy the .service files to your systemd directory:

cp *.service /etc/systemd/system/
systemctl daemon-reload

Enable and start the services you need:

systemctl enable --now aismart aiembed whisper xtts

3. API Gateway Configuration

Deploy the KrakenD configuration:

krakend run -c krakend.json

The gateway includes API Key authentication by default (configurable in krakend.json).

🔐 Authentication

Security is handled at the gateway level. You can manage access via the auth/api-keys section in krakend.json.

Default Key: a132b20c-96be-467f-a15a-ed08aed67888

📜 License

This project configuration is provided for convenience. Ensure you comply with the licenses of the individual models (Qwen, DeepSeek, Whisper, BGE) and engines (llama.cpp, whisper.cpp) used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Unified Llama OpenAI-Compatible API Gateway

🌟 Features

🏗️ Architecture

🛠️ Combined Endpoints (API Gateway)

💬 Chat & Completions

🔍 Search & Retrieval

🔊 Audio Services

🚀 Installation & Setup

1. Requirements

2. Service Deployment

3. API Gateway Configuration

🔐 Authentication

📜 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🚀 Unified Llama OpenAI-Compatible API Gateway

🌟 Features

🏗️ Architecture

🛠️ Combined Endpoints (API Gateway)

💬 Chat & Completions

🔍 Search & Retrieval

🔊 Audio Services

🚀 Installation & Setup

1. Requirements

2. Service Deployment

3. API Gateway Configuration

🔐 Authentication

📜 License