llm‑router‑api

llm‑router‑api is a lightweight Python library that provides a flexible, extensible proxy for Large Language Model ( LLM) back‑ends. It abstracts the details of multiple model providers (OpenAI‑compatible, Ollama, vLLM, LM Studio, etc.) and offers a unified REST interface with built‑in load‑balancing, health‑checking, and monitoring.

Repository: https://github.com/radlab-dev-group/llm-router

Features

Unified API – One REST surface (/api/...) that proxies calls to any supported LLM back‑end.
Provider Selection – Choose a provider per request using pluggable strategies (balanced, weighted, adaptive, first‑available).
Prompt Management – System prompts are stored as files and can be dynamically injected with placeholder substitution.
Streaming Support – Transparent streaming for both OpenAI‑compatible and Ollama endpoints.
Health Checks – Built‑in ping endpoint and Redis‑based provider health monitoring.
Prometheus Metrics – Optional instrumentation for request counts, latencies, and error rates.
Auto‑Discovery – Endpoints are automatically discovered and instantiated at startup.
Extensible – Add new providers, strategies, or custom endpoints with minimal boilerplate.

Installation

The project uses Python 3.10.6 and a virtualenv‑based workflow.

# Clone the repository
git clone https://github.com/radlab-dev-group/llm-router.git
cd llm-router

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install the package (including optional extras)
pip install -e .[metrics]   # installs Prometheus support

All required third‑party libraries are listed in requirements.txt (e.g., Flask, requests, redis, rdl‑ml‑utils, etc.).

Configuration

Configuration is driven primarily by environment variables and a JSON model‑config file.

Environment Variables

Variable	Description	Default
`LLM_ROUTER_PROMPTS_DIR`	Directory containing predefined system prompts.	`resources/prompts`
`LLM_ROUTER_MODELS_CONFIG`	Path to the models configuration JSON file.	`resources/configs/models-config.json`
`LLM_ROUTER_DEFAULT_EP_LANGUAGE`	Default language for endpoint prompts.	`pl`
`LLM_ROUTER_TIMEOUT`	Timeout (seconds) for llm-router API calls.	`0`
`LLM_ROUTER_EXTERNAL_TIMEOUT`	Timeout (seconds) for external model API calls.	`300`
`LLM_ROUTER_LOG_FILENAME`	Name of the log file.	`llm-router.log`
`LLM_ROUTER_LOG_LEVEL`	Logging level (e.g., INFO, DEBUG).	`INFO`
`LLM_ROUTER_EP_PREFIX`	Prefix for all API endpoints.	`/api`
`LLM_ROUTER_MINIMUM`	Run service in proxy‑only mode (boolean).	`False`
`LLM_ROUTER_IN_DEBUG`	Run server in debug mode (boolean).	`False`
`LLM_ROUTER_BALANCE_STRATEGY`	Strategy used to balance routing between LLM providers. Allowed values are `balanced`, `weighted`, `dynamic_weighted` (beta), `first_available` and `first_available_optim` as defined in `constants_base.py`.	`balanced`
`LLM_ROUTER_REDIS_HOST`	Redis host for load‑balancing when a multi‑provider model is available.	`<empty string>`
`LLM_ROUTER_REDIS_PORT`	Redis port for load‑balancing when a multi‑provider model is available.	`6379`
`LLM_ROUTER_REDIS_PASSWORD`	Password for Redis connection.	`<not set>`
`LLM_ROUTER_REDIS_DB`	Redis database number.	`0`
`LLM_ROUTER_SERVER_TYPE`	Server implementation to use (`flask`, `gunicorn`, `waitress`).	`flask`
`LLM_ROUTER_SERVER_PORT`	Port on which the server listens.	`8080`
`LLM_ROUTER_SERVER_HOST`	Host address for the server.	`localhost`
`LLM_ROUTER_SERVER_WORKERS_COUNT`	Number of workers (used in case when the selected server type supports multiworkers)	`2`
`LLM_ROUTER_SERVER_THREADS_COUNT`	Number of workers threads (used in case when the selected server type supports multithreading)	`8`
`LLM_ROUTER_SERVER_WORKER_CLASS`	If server accepts workers type, its able to set worker class by this environment.	`None`
`LLM_ROUTER_USE_PROMETHEUS`	Enable Prometheus metrics collection.** When set to `True`, the router registers a `/metrics` endpoint exposing Prometheus‑compatible metrics for monitoring.	`False`
`LLM_ROUTER_FORCE_MASKING`	Enable force-masking payload of each endpoint. Each key and value is masked before sending to model provider.	`False`
`LLM_ROUTER_MASKING_WITH_AUDIT`	When enabled, each masking operation is recorded in an audit log. This helps with compliance and traceability by providing a tamper‑evident record of what data was masked and when.	`False`
`LLM_ROUTER_MASKING_STRATEGY_PIPELINE`	Defines the ordered list of masking strategies that will be applied to the request payload. For example, ['fast_masker', 'my_new_masker_strategy'] runs the fast masker first, then the `my_new_masker_strategy` masker. This allows flexible, composable masking flows.	`['fast_masker']`
`LLM_ROUTER_FORCE_GUARDRAIL_REQUEST`	Force guardrail evaluation on every request.	`False`
`LLM_ROUTER_GUARDRAIL_WITH_AUDIT_REQUEST`	Audits all guardrail decisions.	`False`
`GUARDRAIL_STRATEGY_PIPELINE_REQUEST`	Ordered list of guardrail strategies.	`-`
`LLM_ROUTER_FORCE_GUARDRAIL_RESPONSE`	Force guardrail evaluation on every response before user receive the result.	`False`
`LLM_ROUTER_GUARDRAIL_WITH_AUDIT_RESPONSE`	Audits all guardrail decisions (response).	`False`
`LLM_ROUTER_GUARDRAIL_STRATEGY_PIPELINE_RESPONSE`	Ordered list of guardrail strategies (response).	`-`
`LLM_ROUTER_GUARDRAIL_NASK_GUARD_HOST`	Host and port where the `naskguard` service running, NOTE! Read the plugin license before use the proposed model!	`-`
`LLM_ROUTER_GUARDRAIL_SOJKA_GUARD_HOST`	Host and port where the `sojkaguard` service running.	`-`
`LLM_ROUTER_SERVICES_MONITOR_INTERVAL_SECONDS`	Time interval to check services availability. Values lower than 1 will be treated as `not-use monitor` and monitor will not be started.	`5` seconds
`LLM_ROUTER_KEEPALIVE_MODEL_MONITOR_INTERVAL_SECONDS`	Keep alive model monitor time interval.	`1` second
`LLM_ROUTER_PROVIDER_MONITOR_INTERVAL_SECONDS`	Models provider health check interval (in seconds).	`5` seconds
`LLM_ROUTER_UTILS_PLUGINS_PIPELINE`	Utils (plugins) pipeline	`[]`
`LLM_ROUTER_LANGCHAIN_RAG_COLLECTION`	Name of the FAISS collection used by the LangChain RAG plugin. If unset, the collection will be `None` and the plugin will raise an error when instantiated.	`None`
`LLM_ROUTER_LANGCHAIN_RAG_EMBEDDER`	Hugging Face model identifier or local path for the sentence‑embedding model. If unset, it will be `None` and cause an error on plugin creation.	`None`
`LLM_ROUTER_LANGCHAIN_RAG_DEVICE`	Torch device on which the embedding model runs (`cpu`, `cuda:0`, etc.).	`"cpu"`
`LLM_ROUTER_LANGCHAIN_RAG_CHUNK_SIZE`	Number of tokens per chunk when splitting texts.	`400` tokens
`LLM_ROUTER_LANGCHAIN_RAG_CHUNK_OVERLAP`	Number of overlapping tokens between consecutive chunks.	`100` tokens
`LLM_ROUTER_LLANGCHAIN_RAG_PERSIST_DIR`	Store the FAISS index under the given directory (if set, if not set then index will not be stored).	`None`

When any required variable (LANGCHAIN_RAG_COLLECTION or LANGCHAIN_RAG_EMBEDDER) is missing, the RAG functionality is effectively disabled and attempts to use the LangchainRAGPlugin will raise an exception. The * optional* variables (LANGCHAIN_RAG_DEVICE, LANGCHAIN_RAG_CHUNK_SIZE, LANGCHAIN_RAG_CHUNK_OVERLAP) fall back to the defaults shown above.

Model Configuration

models-config.json follows the schema:

{
  "active_models": {
    "openai_models": [
      "gpt-4",
      "gpt-3.5-turbo"
    ],
    "ollama_models": [
      "llama2"
    ]
  },
  "openai_models": {
    "gpt-4": {
      "providers": [
        {
          "id": "openai-gpt4-1",
          "api_host": "https://api.openai.com/v1",
          "api_token": "sk-...",
          "api_type": "openai",
          "input_size": 8192,
          "model_path": ""
        }
      ]
    }
  },
  ...
}

Only the fields required by the router are needed: id, api_host, api_token (optional), api_type, input_size, and optionally model_path.

Configuration Details – see the full schema and a ready‑made example in MODELS_CONFIG.md.

Running the Server

The entry point is llm_router_api.rest_api. Choose a server backend via the LLM_ROUTER_SERVER_TYPE variable or command‑line flags.

# Using the built‑in Flask development server (default)
python -m llm_router_api.rest_api

# Production‑grade with Gunicorn (streaming supported)
python -m llm_router_api.rest_api --gunicorn

# Windows‑friendly Waitress server
python -m llm_router_api.rest_api --waitress

The server starts on the host/port defined by LLM_ROUTER_SERVER_HOST and LLM_ROUTER_SERVER_PORT (default 0.0.0.0:8080).

Note: The service must be launched with LLM_ROUTER_MINIMUM=1 (or any truthy value) because it operates in “proxy‑only” mode.

REST API Overview

All routes are prefixed by LLM_ROUTER_EP_PREFIX (default /api). The list of endpoints—categorized into built‑in, provider‑dependent, and extended endpoints—and a description of the streaming mechanisms can be found at the link: load endpoints overview

Load‑Balancing Strategies

The router selects a provider for a given model request using the ProviderChooser. The strategy can be chosen via the LLM_ROUTER_BALANCE_STRATEGY variable.

The current list of available strategies, the interface description, and an example extension can be found at the link load balancing strategies

Keep‑Alive Mechanism

The keep‑alive subsystem periodically pings model endpoints to keep them warm, reducing latency for the first request after idle periods. Configuration is driven by the keep_alive field in the provider definition (see KEEPALIVE.md). Strategies that select providers can register usage with the KeepAliveMonitor, which handles scheduling and background execution.

For details on how to enable and configure keep‑alive, refer to the dedicated documentation: Keep‑Alive Overview

Extending the Router

Adding a New Provider Type

Implement ApiTypesI
Create a class (e.g., MyProviderType) that implements the abstract methods chat_ep, chat_method, completions_ep, and completions_method.
Register in Dispatcher
Add the class to ApiTypesDispatcher._REGISTRY with a lowercase key.
Update Constants (optional)
If you need a new balance strategy, extend BalanceStrategies in constants_base.py.

Adding a New Endpoint

Choose a base class:
- EndpointWithHttpRequestI for full proxy behaviour (default).
- PassthroughI if you only need to forward the request unchanged.
- Directly subclass EndpointI for non‑proxy use cases.
Define REQUIRED_ARGS, OPTIONAL_ARGS, and optionally SYSTEM_PROMPT_NAME.
Implement prepare_payload(self, params) – convert incoming parameters to the payload expected by the downstream model.
(Optional) Set self._prepare_response_function to post‑process the model response.
The endpoint will be auto‑discovered by EndpointAutoLoader at startup.

Prompt Files

Prompt files live under the directory configured by LLM_ROUTER_PROMPTS_DIR.
File naming convention: <category>/system/<lang>/<prompt-id>.
Placeholders such as ##PLACEHOLDER## can be replaced via self._map_prompt in the endpoint implementation.

Monitoring & Metrics

When LLM_ROUTER_USE_PROMETHEUS=1 (or true) the router automatically:

Exposes a /metrics endpoint (Prometheus format).
Tracks request counts, latency histograms, in‑progress gauges, and error counters.

You can scrape this endpoint with a Prometheus server or query it manually.

License

llm-router-api is released under the Apache 2.0. See the LICENSE file in the repository for full terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm‑router‑api

Features

Installation

Configuration

Environment Variables

Model Configuration

Running the Server

REST API Overview

Load‑Balancing Strategies

Keep‑Alive Mechanism

Extending the Router

Adding a New Provider Type

Adding a New Endpoint

Prompt Files

Monitoring & Metrics

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

llm‑router‑api

Features

Installation

Configuration

Environment Variables

Model Configuration

Running the Server

REST API Overview

Load‑Balancing Strategies

Keep‑Alive Mechanism

Extending the Router

Adding a New Provider Type

Adding a New Endpoint

Prompt Files

Monitoring & Metrics

License