You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, OpenAI-compatible endpoints).
4
+
Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, SGLang, Llamacpp, Lemonade, Anthropic, and OpenAI-compatible endpoints).
5
5
6
6
The project provides two proxy engines: Sherpa (simple, maintainable) and Olla (high-performance with advanced features).
7
7
8
+
Full documentation available at: https://thushan.github.io/olla/
9
+
8
10
## Commands
9
11
```bash
10
-
make ready # Run before commit (test + lint + fmt)
11
-
make dev # Development mode (auto-reload)
12
-
make test# Run all tests
13
-
make bench # Run benchmarks
12
+
make ready # Run before commit (test-short + test-race + fmt + lint + align)
13
+
make ready-tools # Check code with tools only (fmt + lint + align)
14
+
make test# Run all tests
15
+
make test-race # Run tests with race detection
16
+
make test-stress # Run comprehensive stress tests
17
+
make bench # Run all benchmarks
18
+
make bench-balancer # Run balancer benchmarks
19
+
make build # Build optimised binary with version info
20
+
make build-local # Build binary to ./build/ (fast, for testing)
Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure.
27
-
28
-
It intelligently routes LLM requests across local and remote inference nodes - including [LlamaCpp](https://github.com/ggerganov/llama.cpp) backends like [Ollama](https://github.com/ollama/ollama), [LM Studio](https://lmstudio.ai/) or [SGLang](https://github.com/sgl-project/sglang) (with RadixAttention), [vLLM](https://github.com/vllm-project/vllm), [Lemonade SDK](https://lemonade-server.ai) (AMD Ryzen AI), [LiteLLM](https://github.com/BerriAI/litellm) and other OpenAI-compatible endpoints.
29
-
30
-
Olla provides model discovery and unified model catalogues across all providers, enabling seamless routing to available models on compatible endpoints.
26
+
Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a [wide variety](https://thushan.github.io/olla/integrations/overview/) of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
31
27
32
-
With native [LiteLLM support](integrations/backend/litellm.md), Olla bridges local and cloud infrastructure - use local models when available, automatically failover to cloud APIs when needed. Unlike orchestration platforms like [GPUStack](compare/gpustack.md), Olla focuses on making your existing LLM infrastructure reliable through intelligent routing and failover.
28
+
Olla works alongside API gateways like [LiteLLM](https://github.com/BerriAI/litellm) or orchestration platforms like [GPUStack](https://github.com/gpustack/gpustack), focusing on making your **existing** LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: **Sherpa** for simplicity and maintainability or **Olla** for maximum performance with advanced features like circuit breakers and connection pooling.
0 commit comments