Skip to content

Latest commit

 

History

History
93 lines (68 loc) · 3.52 KB

File metadata and controls

93 lines (68 loc) · 3.52 KB

🚀 Unified Llama OpenAI-Compatible API Gateway

A comprehensive set of service configurations to deploy a high-performance, private AI infrastructure. This project leverages llama.cpp, whisper.cpp, and various specialized models to provide a unified OpenAI-compatible API interface through a KrakenD API Gateway.

🌟 Features

  • Full OpenAI Compatibility: Seamlessly use your favorite AI tools and clients.
  • Unified Gateway: Single entry point for completions, tools, embeddings, and audio services via KrakenD.
  • GPU Optimized: Configurations tuned for CUDA-accelerated inference.
  • Robust Deployment: Ready-to-use systemd service files for automated startup and recovery.
  • Diverse Model Support:
    • 🧠 LLM: Qwen, DeepSeek, Llama.
    • 🔍 Embeddings: BGE-M3, Nomic-Embed.
    • 🔄 Reranking: BGE-Reranker.
    • 🎙️ Audio: Whisper (Turbo, Large-v3) & XTTS/Silero TTS.

🏗️ Architecture

The infrastructure is split into individual microservices unified by a KrakenD gateway:

Service Endpoint (Internal) Purpose Backend Engine
aismart.service :6150 High-quality smart completion llama-server (Qwen3.5-35B)
aifast.service :6155 Performance-optimized completion llama-server (Fast models)
aicoder.service :5000 Specialized coding completions llama-server (DeepSeek-Coder)
aiembed.service :5500 Text embedding generation llama-server (BGE-M3)
airerank.service :5550 Search result reranking llama-server (BGE-Reranker)
whisper.service :5005 STT (Speech-to-Text) whisper-server (Whisper Large-v3)
xtts.service :5050 / :10200 TTS (Text-to-Speech) Silero / XTTS API Server

🛠️ Combined Endpoints (API Gateway)

The KrakenD Gateway (running on port 9000) exposes the following unified endpoints:

💬 Chat & Completions

  • POST /v1/chat/completions - Unified chat interface.
  • POST /v1/completions - Legacy completion support.
  • POST /v1/tools - Tool-use and function calling support.

🔍 Search & Retrieval

  • POST /v1/embeddings - Generate vector representations of text.
  • POST /v1/rerank - Rank documents based on query relevance.

🔊 Audio Services

  • POST /v1/audio/transcriptions - Convert audio to text.
  • POST /v1/audio/translations - Translate audio in real-time.
  • POST /v1/audio/speech - Convert text to natural-sounding audio.

🚀 Installation & Setup

1. Requirements

  • Linux OS (Ubuntu/Debian recommended)
  • NVIDIA GPU with CUDA drivers
  • llama.cpp and whisper.cpp compiled and available in /ai/
  • KrakenD installed

2. Service Deployment

Copy the .service files to your systemd directory:

cp *.service /etc/systemd/system/
systemctl daemon-reload

Enable and start the services you need:

systemctl enable --now aismart aiembed whisper xtts

3. API Gateway Configuration

Deploy the KrakenD configuration:

krakend run -c krakend.json

The gateway includes API Key authentication by default (configurable in krakend.json).


🔐 Authentication

Security is handled at the gateway level. You can manage access via the auth/api-keys section in krakend.json.

Default Key: a132b20c-96be-467f-a15a-ed08aed67888


📜 License

This project configuration is provided for convenience. Ensure you comply with the licenses of the individual models (Qwen, DeepSeek, Whisper, BGE) and engines (llama.cpp, whisper.cpp) used.