llm‑router‑api is a lightweight Python library that provides a flexible, extensible proxy for Large Language Model ( LLM) back‑ends. It abstracts the details of multiple model providers (OpenAI‑compatible, Ollama, vLLM, LM Studio, etc.) and offers a unified REST interface with built‑in load‑balancing, health‑checking, and monitoring.
Repository: https://github.com/radlab-dev-group/llm-router
- Unified API – One REST surface (
/api/...) that proxies calls to any supported LLM back‑end. - Provider Selection – Choose a provider per request using pluggable strategies (balanced, weighted, adaptive, first‑available).
- Prompt Management – System prompts are stored as files and can be dynamically injected with placeholder substitution.
- Streaming Support – Transparent streaming for both OpenAI‑compatible and Ollama endpoints.
- Health Checks – Built‑in ping endpoint and Redis‑based provider health monitoring.
- Prometheus Metrics – Optional instrumentation for request counts, latencies, and error rates.
- Auto‑Discovery – Endpoints are automatically discovered and instantiated at startup.
- Extensible – Add new providers, strategies, or custom endpoints with minimal boilerplate.
The project uses Python 3.10.6 and a virtualenv‑based workflow.
# Clone the repository
git clone https://github.com/radlab-dev-group/llm-router.git
cd llm-router
# Create a virtual environment
python3 -m venv venv
source venv/bin/activate
# Install the package (including optional extras)
pip install -e .[metrics] # installs Prometheus supportAll required third‑party libraries are listed in requirements.txt (e.g., Flask, requests, redis, rdl‑ml‑utils, etc.).
Configuration is driven primarily by environment variables and a JSON model‑config file.
| Variable | Description | Default |
|---|---|---|
LLM_ROUTER_PROMPTS_DIR |
Directory containing predefined system prompts. | resources/prompts |
LLM_ROUTER_MODELS_CONFIG |
Path to the models configuration JSON file. | resources/configs/models-config.json |
LLM_ROUTER_DEFAULT_EP_LANGUAGE |
Default language for endpoint prompts. | pl |
LLM_ROUTER_TIMEOUT |
Timeout (seconds) for llm-router API calls. | 0 |
LLM_ROUTER_EXTERNAL_TIMEOUT |
Timeout (seconds) for external model API calls. | 300 |
LLM_ROUTER_LOG_FILENAME |
Name of the log file. | llm-router.log |
LLM_ROUTER_LOG_LEVEL |
Logging level (e.g., INFO, DEBUG). | INFO |
LLM_ROUTER_EP_PREFIX |
Prefix for all API endpoints. | /api |
LLM_ROUTER_MINIMUM |
Run service in proxy‑only mode (boolean). | False |
LLM_ROUTER_IN_DEBUG |
Run server in debug mode (boolean). | False |
LLM_ROUTER_BALANCE_STRATEGY |
Strategy used to balance routing between LLM providers. Allowed values are balanced, weighted, dynamic_weighted (beta), first_available and first_available_optim as defined in constants_base.py. |
balanced |
LLM_ROUTER_REDIS_HOST |
Redis host for load‑balancing when a multi‑provider model is available. | <empty string> |
LLM_ROUTER_REDIS_PORT |
Redis port for load‑balancing when a multi‑provider model is available. | 6379 |
LLM_ROUTER_REDIS_PASSWORD |
Password for Redis connection. | <not set> |
LLM_ROUTER_REDIS_DB |
Redis database number. | 0 |
LLM_ROUTER_SERVER_TYPE |
Server implementation to use (flask, gunicorn, waitress). |
flask |
LLM_ROUTER_SERVER_PORT |
Port on which the server listens. | 8080 |
LLM_ROUTER_SERVER_HOST |
Host address for the server. | localhost |
LLM_ROUTER_SERVER_WORKERS_COUNT |
Number of workers (used in case when the selected server type supports multiworkers) | 2 |
LLM_ROUTER_SERVER_THREADS_COUNT |
Number of workers threads (used in case when the selected server type supports multithreading) | 8 |
LLM_ROUTER_SERVER_WORKER_CLASS |
If server accepts workers type, its able to set worker class by this environment. | None |
LLM_ROUTER_USE_PROMETHEUS |
Enable Prometheus metrics collection.** When set to True, the router registers a /metrics endpoint exposing Prometheus‑compatible metrics for monitoring. |
False |
LLM_ROUTER_FORCE_MASKING |
Enable force-masking payload of each endpoint. Each key and value is masked before sending to model provider. | False |
LLM_ROUTER_MASKING_WITH_AUDIT |
When enabled, each masking operation is recorded in an audit log. This helps with compliance and traceability by providing a tamper‑evident record of what data was masked and when. | False |
LLM_ROUTER_MASKING_STRATEGY_PIPELINE |
Defines the ordered list of masking strategies that will be applied to the request payload. For example, ['fast_masker', 'my_new_masker_strategy'] runs the fast masker first, then the my_new_masker_strategy masker. This allows flexible, composable masking flows. |
['fast_masker'] |
LLM_ROUTER_FORCE_GUARDRAIL_REQUEST |
Force guardrail evaluation on every request. | False |
LLM_ROUTER_GUARDRAIL_WITH_AUDIT_REQUEST |
Audits all guardrail decisions. | False |
GUARDRAIL_STRATEGY_PIPELINE_REQUEST |
Ordered list of guardrail strategies. | - |
LLM_ROUTER_FORCE_GUARDRAIL_RESPONSE |
Force guardrail evaluation on every response before user receive the result. | False |
LLM_ROUTER_GUARDRAIL_WITH_AUDIT_RESPONSE |
Audits all guardrail decisions (response). | False |
LLM_ROUTER_GUARDRAIL_STRATEGY_PIPELINE_RESPONSE |
Ordered list of guardrail strategies (response). | - |
LLM_ROUTER_GUARDRAIL_NASK_GUARD_HOST |
Host and port where the naskguard service running, NOTE! Read the plugin license before use the proposed model! |
- |
LLM_ROUTER_GUARDRAIL_SOJKA_GUARD_HOST |
Host and port where the sojkaguard service running. |
- |
LLM_ROUTER_SERVICES_MONITOR_INTERVAL_SECONDS |
Time interval to check services availability. Values lower than 1 will be treated as not-use monitor and monitor will not be started. |
5 seconds |
LLM_ROUTER_KEEPALIVE_MODEL_MONITOR_INTERVAL_SECONDS |
Keep alive model monitor time interval. | 1 second |
LLM_ROUTER_PROVIDER_MONITOR_INTERVAL_SECONDS |
Models provider health check interval (in seconds). | 5 seconds |
LLM_ROUTER_UTILS_PLUGINS_PIPELINE |
Utils (plugins) pipeline | [] |
LLM_ROUTER_LANGCHAIN_RAG_COLLECTION |
Name of the FAISS collection used by the LangChain RAG plugin. If unset, the collection will be None and the plugin will raise an error when instantiated. |
None |
LLM_ROUTER_LANGCHAIN_RAG_EMBEDDER |
Hugging Face model identifier or local path for the sentence‑embedding model. If unset, it will be None and cause an error on plugin creation. |
None |
LLM_ROUTER_LANGCHAIN_RAG_DEVICE |
Torch device on which the embedding model runs (cpu, cuda:0, etc.). |
"cpu" |
LLM_ROUTER_LANGCHAIN_RAG_CHUNK_SIZE |
Number of tokens per chunk when splitting texts. | 400 tokens |
LLM_ROUTER_LANGCHAIN_RAG_CHUNK_OVERLAP |
Number of overlapping tokens between consecutive chunks. | 100 tokens |
LLM_ROUTER_LLANGCHAIN_RAG_PERSIST_DIR |
Store the FAISS index under the given directory (if set, if not set then index will not be stored). | None |
When any required variable (LANGCHAIN_RAG_COLLECTION or LANGCHAIN_RAG_EMBEDDER) is missing, the RAG
functionality is effectively disabled and attempts to use the LangchainRAGPlugin will raise an exception. The *
optional* variables (LANGCHAIN_RAG_DEVICE, LANGCHAIN_RAG_CHUNK_SIZE, LANGCHAIN_RAG_CHUNK_OVERLAP) fall back to
the defaults shown above.
models-config.json follows the schema:
{
"active_models": {
"openai_models": [
"gpt-4",
"gpt-3.5-turbo"
],
"ollama_models": [
"llama2"
]
},
"openai_models": {
"gpt-4": {
"providers": [
{
"id": "openai-gpt4-1",
"api_host": "https://api.openai.com/v1",
"api_token": "sk-...",
"api_type": "openai",
"input_size": 8192,
"model_path": ""
}
]
}
},
...
}Only the fields required by the router are needed: id, api_host, api_token (optional), api_type, input_size,
and optionally model_path.
Configuration Details – see the full schema and a ready‑made example in MODELS_CONFIG.md.
The entry point is llm_router_api.rest_api. Choose a server backend via the LLM_ROUTER_SERVER_TYPE variable or
command‑line flags.
# Using the built‑in Flask development server (default)
python -m llm_router_api.rest_api
# Production‑grade with Gunicorn (streaming supported)
python -m llm_router_api.rest_api --gunicorn
# Windows‑friendly Waitress server
python -m llm_router_api.rest_api --waitressThe server starts on the host/port defined by LLM_ROUTER_SERVER_HOST and LLM_ROUTER_SERVER_PORT (default
0.0.0.0:8080).
Note: The service must be launched with LLM_ROUTER_MINIMUM=1 (or any truthy value) because it operates in
“proxy‑only” mode.
All routes are prefixed by LLM_ROUTER_EP_PREFIX (default /api).
The list of endpoints—categorized into built‑in, provider‑dependent, and extended endpoints—and
a description of the streaming mechanisms can be found at the link:
load endpoints overview
The router selects a provider for a given model request using the ProviderChooser. The strategy can be chosen via
the LLM_ROUTER_BALANCE_STRATEGY variable.
The current list of available strategies, the interface description, and an example extension can be found at the link load balancing strategies
The keep‑alive subsystem periodically pings model endpoints to keep them warm, reducing latency for the first request
after idle periods. Configuration is driven by the keep_alive field in the provider definition
(see KEEPALIVE.md). Strategies that select providers can register usage with the KeepAliveMonitor,
which handles scheduling and background execution.
For details on how to enable and configure keep‑alive, refer to the dedicated documentation: Keep‑Alive Overview
- Implement
ApiTypesI
Create a class (e.g.,MyProviderType) that implements the abstract methodschat_ep,chat_method,completions_ep, andcompletions_method. - Register in Dispatcher
Add the class toApiTypesDispatcher._REGISTRYwith a lowercase key. - Update Constants (optional)
If you need a new balance strategy, extendBalanceStrategiesinconstants_base.py.
- Choose a base class:
EndpointWithHttpRequestIfor full proxy behaviour (default).PassthroughIif you only need to forward the request unchanged.- Directly subclass
EndpointIfor non‑proxy use cases.
- Define
REQUIRED_ARGS,OPTIONAL_ARGS, and optionallySYSTEM_PROMPT_NAME. - Implement
prepare_payload(self, params)– convert incoming parameters to the payload expected by the downstream model. - (Optional) Set
self._prepare_response_functionto post‑process the model response. - The endpoint will be auto‑discovered by
EndpointAutoLoaderat startup.
Prompt files live under the directory configured by LLM_ROUTER_PROMPTS_DIR.
File naming convention: <category>/system/<lang>/<prompt-id>.
Placeholders such as ##PLACEHOLDER## can be replaced via self._map_prompt in the endpoint implementation.
When LLM_ROUTER_USE_PROMETHEUS=1 (or true) the router automatically:
- Exposes a
/metricsendpoint (Prometheus format). - Tracks request counts, latency histograms, in‑progress gauges, and error counters.
You can scrape this endpoint with a Prometheus server or query it manually.
llm-router-api is released under the Apache 2.0. See the LICENSE file in the repository for full terms.