You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: readme.md
+31-3Lines changed: 31 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,13 +30,13 @@
30
30
31
31
Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes - including [Ollama](https://github.com/ollama/ollama), [LM Studio](https://lmstudio.ai/) and OpenAI-compatible endpoints like [vLLM](https://github.com/vllm-project/vllm). Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
32
32
33
-
You can choose between two proxy engines: **Sherpa** for simplicity and maintainability or **Olla** for maximum performance with advanced features like circuit breakers and connection pooling.
33
+
Unlike API gateways like [LiteLLM](https://github.com/BerriAI/litellm) or orchestration platforms like [GPUStack](https://github.com/gpustack/gpustack), Olla focuses on making your **existing** LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: **Sherpa** for simplicity and maintainability or **Olla** for maximum performance with advanced features like circuit breakers and connection pooling.
34
34
35
35
Single CLI application and config file is all you need to go Olla!
In the above example, we configure [Jetbrains Junie](https://www.jetbrains.com/junie/) to use Olla for its Ollama and LMStudio endpoints for local-ai inference with Junie.
39
+
In the above example, we configure [Jetbrains Junie](https://www.jetbrains.com/junie/) to use Olla for its Ollama and LMStudio endpoints for local-ai inference with Junie (see [how to configure Jetbrains Junie](https://thushan.github.io/olla/usage/#development-tools-junie)).
40
40
41
41
## Key Features
42
42
@@ -52,6 +52,18 @@ In the above example, we configure [Jetbrains Junie](https://www.jetbrains.com/j
52
52
-**🎯 LLM-Optimised**: Streaming-first design with optimised timeouts for long inference
53
53
-**⚙️ High Performance**: Designed to be very [lightweight & efficient](https://thushan.github.io/olla/configuration/practices/performance/), runs on less than 50Mb RAM.
|**[LocalAI](https://github.com/mudler/LocalAI)**| OpenAI-compatible local API | ✅ Load balance multiple instances |
63
+
|**[Ollama](https://github.com/ollama/ollama)**| Local model serving | ✅ Primary use case |
64
+
65
+
See our [detailed comparisons](https://thushan.github.io/olla/compare/overview/) and [integration patterns](https://thushan.github.io/olla/compare/integration-patterns/) for more.
66
+
55
67
### Supported Backends
56
68
57
69
Olla natively supports the following backend providers. Learn more about [Olla Integrations](https://thushan.github.io/olla/integrations/overview/).
@@ -156,6 +168,15 @@ Complete setup with [OpenWebUI](https://github.com/open-webui/open-webui) + Olla
156
168
157
169
You can learn more about [OpenWebUI Ollama with Olla](https://thushan.github.io/olla/integrations/frontend/openwebui/).
158
170
171
+
### Common Architectures
172
+
173
+
-**Home Lab**: Olla → Multiple Ollama instances across your machines
174
+
-**Hybrid Cloud**: Olla → Local endpoints + LiteLLM → Cloud APIs
- **[API Reference](https://thushan.github.io/olla/api-reference/overview/)** - Full API documentation
431
452
- **[Concepts](https://thushan.github.io/olla/concepts/overview/)** - Core concepts and architecture
453
+
- **[Comparisons](https://thushan.github.io/olla/compare/overview/)** - Compare with LiteLLM, GPUStack, LocalAI
432
454
- **[Integrations](https://thushan.github.io/olla/integrations/overview/)** - Frontend and backend integrations
433
455
- **[Development](https://thushan.github.io/olla/development/overview/)** - Contributing and development guide
434
456
@@ -456,8 +478,14 @@ The built-in security features are optimised for this deployment pattern:
456
478
**Q: Why use Olla instead of nginx or HAProxy?** \
457
479
A: Olla understands LLM-specific patterns like model routing, streaming responses, and health semantics. It also provides built-in model discovery and LLM-optimised timeouts.
458
480
481
+
**Q: How does Olla compare to LiteLLM?** \
482
+
A: [LiteLLM](https://github.com/BerriAI/litellm) is an API translation layer for cloud providers (OpenAI, Anthropic, etc.), while Olla is an infrastructure proxy for self-hosted endpoints. They work great together - use LiteLLM for cloud APIs and Olla for local infrastructure reliability. See our [detailed comparison](https://thushan.github.io/olla/compare/litellm/).
483
+
484
+
**Q: Can Olla manage GPU clusters like GPUStack?** \
485
+
A: No, Olla doesn't deploy or orchestrate models. For GPU cluster management, use [GPUStack](https://github.com/gpustack/gpustack). Olla can then provide routing and failover for your GPUStack-managed endpoints. See our [comparison guide](https://thushan.github.io/olla/compare/gpustack/).
486
+
459
487
**Q: Can I use Olla with other LLM providers?** \
460
-
A: Yes! Any OpenAI-compatible API works. Configure them as `type: "openai-compatible"` endpoints (such as vLLM, LocalAI, Together AI, etc.).
488
+
A: Yes! Any OpenAI-compatible API works. Configure them as `type: "openai-compatible"` endpoints (such as LiteLLM, [LocalAI](https://github.com/mudler/LocalAI), Together AI, etc.). See [integration patterns](https://thushan.github.io/olla/compare/integration-patterns/).
461
489
462
490
**Q: Does Olla support authentication?** \
463
491
A: Olla focuses on load balancing and lets your reverse proxy handle authentication. This follows the Unix philosophy of doing one thing well.
0 commit comments