Skip to content

Commit df959f0

Browse files
authored
Merge pull request #78 from thushan/prepare/v0.0.20
prepare: v0.0.20
2 parents f424551 + 39f4e02 commit df959f0

File tree

5 files changed

+161
-82
lines changed

5 files changed

+161
-82
lines changed

CLAUDE.md

Lines changed: 129 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,27 @@
11
# CLAUDE.md
22

33
## Overview
4-
Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, OpenAI-compatible endpoints).
4+
Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, SGLang, Llamacpp, Lemonade, Anthropic, and OpenAI-compatible endpoints).
55

66
The project provides two proxy engines: Sherpa (simple, maintainable) and Olla (high-performance with advanced features).
77

8+
Full documentation available at: https://thushan.github.io/olla/
9+
810
## Commands
911
```bash
10-
make ready # Run before commit (test + lint + fmt)
11-
make dev # Development mode (auto-reload)
12-
make test # Run all tests
13-
make bench # Run benchmarks
12+
make ready # Run before commit (test-short + test-race + fmt + lint + align)
13+
make ready-tools # Check code with tools only (fmt + lint + align)
14+
make test # Run all tests
15+
make test-race # Run tests with race detection
16+
make test-stress # Run comprehensive stress tests
17+
make bench # Run all benchmarks
18+
make bench-balancer # Run balancer benchmarks
19+
make build # Build optimised binary with version info
20+
make build-local # Build binary to ./build/ (fast, for testing)
21+
make run # Run with version info
22+
make run-debug # Run with debug logging
23+
make ci # Run full CI pipeline locally
24+
make help # Show all available targets
1425
```
1526

1627
## Project Structure
@@ -21,57 +32,71 @@ olla/
2132
├── config/
2233
│ ├── profiles/ # Provider-specific profiles
2334
│ │ ├── ollama.yaml # Ollama configuration
35+
│ │ ├── llamacpp.yaml # llama.cpp configuration
2436
│ │ ├── lmstudio.yaml # LM Studio configuration
37+
│ │ ├── lemonade.yaml # Lemonade SDK configuration
2538
│ │ ├── litellm.yaml # LiteLLM gateway configuration
26-
│ │ ├── openai.yaml # OpenAI-compatible configuration
27-
│ │ └── vllm.yaml # vLLM configuration
28-
│ └── models.yaml # Model configurations
39+
│ │ ├── vllm.yaml # vLLM configuration
40+
│ │ ├── sglang.yaml # SGLang configuration
41+
│ │ ├── anthropic.yaml # Anthropic Claude API configuration
42+
│ │ └── openai.yaml # OpenAI-compatible generic profile
43+
│ ├── models.yaml # Model configurations
44+
│ └── config.local.yaml # Local configuration overrides (user, not committed to git)
2945
├── internal/
3046
│ ├── core/ # Domain layer (business logic)
3147
│ │ ├── domain/ # Core entities
32-
│ │ │ ├── endpoint.go # Endpoint management
33-
│ │ │ ├── model.go # Model registry
34-
│ │ │ ├── unified_model.go # Unified model format
35-
│ │ │ └── routing.go # Request routing logic
3648
│ │ ├── ports/ # Interface definitions
3749
│ │ └── constants/ # Application constants
3850
│ ├── adapter/ # Infrastructure layer
3951
│ │ ├── balancer/ # Load balancing strategies
40-
│ │ │ ├── priority.go # Priority-based selection
41-
│ │ │ ├── round_robin.go # Round-robin selection
42-
│ │ │ └── least_connections.go # Least connections selection
52+
│ │ ├── converter/ # Model format converters
53+
│ │ ├── discovery/ # Service discovery
54+
│ │ ├── factory/ # Factory patterns
55+
│ │ ├── filter/ # Request/response filtering
56+
│ │ ├── health/ # Health checking & circuit breakers
57+
│ │ ├── inspector/ # Request inspection
58+
│ │ ├── metrics/ # Metrics collection
4359
│ │ ├── proxy/ # Proxy implementations
4460
│ │ │ ├── sherpa/ # Simple, maintainable proxy
4561
│ │ │ ├── olla/ # High-performance proxy
4662
│ │ │ └── core/ # Shared proxy components
47-
│ │ ├── health/ # Health checking
48-
│ │ │ ├── checker.go # Health check coordinator
49-
│ │ │ └── circuit_breaker.go # Circuit breaker implementation
50-
│ │ ├── discovery/ # Service discovery
51-
│ │ │ └── service.go # Model discovery service
5263
│ │ ├── registry/ # Model & profile registries
53-
│ │ │ ├── profile/ # Provider profiles
54-
│ │ │ └── unified_memory_registry.go # Unified model registry
55-
│ │ ├── unifier/ # Model unification
56-
│ │ ├── converter/ # Model format converters
57-
│ │ ├── inspector/ # Request inspection
58-
│ │ ├── security/ # Security features
59-
│ │ │ ├── request_rate_limit.go # Rate limiting
60-
│ │ │ └── request_size_limit.go # Size limiting
61-
│ │ └── stats/ # Statistics collection
62-
│ │ ├── collector.go # Main stats collector
63-
│ │ └── model_collector.go # Model-specific stats
64-
│ └── app/ # Application layer
65-
│ ├── app.go # Service manager
66-
│ └── handlers/ # HTTP handlers
67-
│ ├── handler_proxy.go # Main proxy handler
68-
│ ├── handler_status.go # Status endpoints
69-
│ └── handler_health.go # Health endpoints
64+
│ │ ├── security/ # Security features (rate/size limits)
65+
│ │ ├── stats/ # Statistics collection
66+
│ │ ├── translator/ # API translation layer (OpenAI ↔ Provider)
67+
│ │ └── unifier/ # Model unification
68+
│ ├── app/ # Application layer
69+
│ │ ├── handlers/ # HTTP handlers
70+
│ │ │ ├── server.go # HTTP server setup
71+
│ │ │ ├── server_routes.go # Route registration
72+
│ │ │ ├── handler_proxy.go # Main proxy handler
73+
│ │ │ ├── handler_provider_*.go # Provider-specific handlers
74+
│ │ │ ├── handler_translation.go # Translation handler
75+
│ │ │ ├── handler_status*.go # Status endpoints
76+
│ │ │ ├── handler_health.go # Health endpoints
77+
│ │ │ └── handler_version.go # Version information
78+
│ │ ├── middleware/ # HTTP middleware
79+
│ │ └── services/ # Application services
80+
│ ├── config/ # Configuration management
81+
│ ├── env/ # Environment handling
82+
│ ├── integration/ # Integration tests
83+
│ ├── logger/ # Logging framework
84+
│ ├── router/ # Routing logic
85+
│ ├── util/ # Utilities
86+
│ └── version/ # Version management
7087
├── pkg/ # Reusable packages
88+
│ ├── container/ # Dependency injection
89+
│ ├── eventbus/ # Event bus (pub/sub)
90+
│ ├── format/ # Formatting utilities
91+
│ ├── nerdstats/ # Process statistics
7192
│ ├── pool/ # Object pooling
72-
│ └── nerdstats/ # Process statistics
93+
│ └── profiler/ # Profiling support
7394
└── test/
7495
└── scripts/ # Test scripts
96+
├── anthropic/ # Anthropic API tests
97+
├── cases/ # Test cases
98+
├── inspector/ # Inspector tests
99+
├── load/ # Load testing
75100
├── logic/ # Logic & routing tests
76101
├── security/ # Security tests
77102
└── streaming/ # Streaming tests
@@ -80,49 +105,90 @@ olla/
80105
## Key Files
81106
- `main.go` - Application entry point
82107
- `config.yaml` - Main configuration
108+
- `internal/app/handlers/server_routes.go` - Route registration & API setup
83109
- `internal/app/handlers/handler_proxy.go` - Request routing logic
84-
- `internal/adapter/proxy/sherpa/service.go` - Sherpa proxy
85-
- `internal/adapter/proxy/olla/service.go` - Olla proxy
110+
- `internal/adapter/proxy/sherpa/service.go` - Sherpa proxy implementation
111+
- `internal/adapter/proxy/olla/service.go` - Olla proxy implementation
112+
- `internal/adapter/translator/` - API translation layer (OpenAI ↔ Provider formats)
113+
- `internal/version/version.go` - Version information embedded at build time
86114
- `/test/scripts/logic/test-model-routing.sh` - Test routing & headers
87115

116+
## API Endpoints
117+
118+
### Internal Endpoints
119+
- `/internal/health` - Health check endpoint
120+
- `/internal/status` - Endpoint status
121+
- `/internal/status/endpoints` - Endpoints status details
122+
- `/internal/status/models` - Models status details
123+
- `/internal/stats/models` - Model statistics
124+
- `/internal/process` - Process statistics
125+
- `/version` - Version information
126+
127+
### Unified Model Endpoints
128+
- `/olla/models` - Unified models listing with filtering
129+
- `/olla/models/{id}` - Get unified model by ID or alias
130+
131+
### Proxy Endpoints
132+
- `/olla/proxy/` - Olla API proxy endpoint (POST)
133+
- `/olla/proxy/v1/models` - OpenAI-compatible models listing (GET)
134+
135+
### Translator Endpoints
136+
Dynamically registered based on configured translators (e.g., Anthropic Messages API)
137+
88138
## Response Headers
89139
- `X-Olla-Endpoint`: Backend name
90140
- `X-Olla-Model`: Model used
91-
- `X-Olla-Backend-Type`: ollama/openai/lmstudio/vllm/litellm
141+
- `X-Olla-Backend-Type`: ollama/openai/lmstudio/vllm/litellm/sglang/llamacpp/lemonade/anthropic
92142
- `X-Olla-Request-ID`: Request ID
93143
- `X-Olla-Response-Time`: Total processing time
94144

95145
## Testing
96-
- Unit tests: Components in isolation
97-
- Integration: Full request flow
98-
- Benchmarks: Performance comparison
99-
- Always run `make ready` before commit
100146

101147
### Testing Strategy
102-
103-
1. **Unit Tests**: Test individual components in isolation
104-
2. **Integration Tests**: Test full request flow through the proxy
105-
3. **Benchmark Tests**:
106-
- Performance of critical paths
107-
- Proxy engine comparisons
108-
- Connection pooling efficiency
109-
- Circuit breaker behavior
110-
4. **Security Tests**: Validate rate limiting and size restrictions (see `/test/scripts/security/`)
111-
5. **Shared Proxy Tests**: Common test suite for both proxy engines ensuring compatibility
148+
1. **Unit Tests**: Components in isolation
149+
2. **Integration Tests**: Full request flow through proxy engines
150+
3. **Benchmark Tests**: Performance comparison (balancers, proxy engines, repositories)
151+
4. **Security Tests**: Rate limiting and size restrictions (see `/test/scripts/security/`)
152+
5. **Stress Tests**: Comprehensive testing under load
153+
6. **Script Tests**: End-to-end scenarios in `/test/scripts/`
112154

113155
### Testing Commands
156+
```bash
157+
# Core test commands
158+
make test # Run all tests
159+
make test-race # Run with race detection
160+
make test-stress # Run stress tests
161+
make test-cover-html # Generate coverage HTML report
114162

115-
```
116-
# Run proxy engine tests
163+
# Benchmark commands
164+
make bench # Run all benchmarks
165+
make bench-balancer # Run balancer benchmarks
166+
make bench-repo # Run repository benchmarks
167+
168+
# Specific test patterns
117169
go test -v ./internal/adapter/proxy -run TestAllProxies
118170
go test -v ./internal/adapter/proxy -run TestSherpa
119171
go test -v ./internal/adapter/proxy -run TestOlla
120172
```
121173

122-
## Notes
174+
Always run `make ready` before committing changes.
175+
176+
## Architecture Notes
177+
178+
### Hexagonal Architecture
179+
- **Domain Layer** (`internal/core`): Business logic, entities, and interfaces
180+
- **Infrastructure Layer** (`internal/adapter`): Implementations (proxies, balancers, registries)
181+
- **Application Layer** (`internal/app`): HTTP handlers, middleware, and services
182+
183+
### Key Components
184+
- **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic)
185+
- **Proxy Engines**: Choose Sherpa (simple) or Olla (high-performance)
186+
- **Load Balancing**: Priority-based recommended for production
187+
- **Version Management**: Build-time version injection via `internal/version`
188+
189+
### Development Guidelines
123190
- Go 1.24+
124-
- Endpoints: `/internal/health`, `/internal/status`
125-
- Proxy prefix: `/olla/`
126-
- Priority balancer recommended for production
127-
- Australian English for comments and documentation, comment on why rather than what.
128-
- Use `make ready` before committing changes to ensure code quality
191+
- Australian English for comments and documentation
192+
- Comment on **why** rather than **what**
193+
- Always run `make ready` before committing
194+
- Use `make help` to see all available commands

config/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ proxy:
2929
# NOTE: From v0.1.0+ we'll switch to Olla as the default proxy engine
3030
# Sherpa will continue to be maintained and supported for the
3131
# foreseeable future
32-
engine: "sherpa" # Available: sherpa, olla
32+
engine: "olla" # Available: sherpa, olla
3333
# Profile controls proxy engine (http) transport buffer behaviour
3434
# - "auto": Dynamically selects based on request size, type and other factors (default)
3535
# - "streaming": No buffering, tokens stream immediately, low latency & low memory usage

docs/content/index.md

Lines changed: 6 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,20 @@ keywords: llm proxy, ollama proxy, lm studio proxy, vllm proxy, sglang, lemonade
1212
<a href="https://github.com/thushan/olla/actions/workflows/ci.yml"><img src="https://github.com/thushan/olla/actions/workflows/ci.yml/badge.svg?branch=main" alt="CI"></a>
1313
<a href="https://goreportcard.com/report/github.com/thushan/olla"><img src="https://goreportcard.com/badge/github.com/thushan/olla" alt="Go Report Card"></a>
1414
<a href="https://github.com/thushan/olla/releases/latest"><img src="https://img.shields.io/github/release/thushan/olla" alt="Latest Release"></a> <br />
15-
<a href="https://ollama.com"><img src="https://img.shields.io/badge/Ollama-native-lightgreen.svg" alt="Ollama: Native Support"></a>
16-
<a href="https://lmstudio.ai/"><img src="https://img.shields.io/badge/LM Studio-native-lightgreen.svg" alt="LM Studio: Native Support"></a>
1715
<a href="https://github.com/ggerganov/llama.cpp"><img src="https://img.shields.io/badge/llama.cpp-native-lightgreen.svg" alt="llama.cpp: Native Support"></a>
1816
<a href="https://github.com/vllm-project/vllm"><img src="https://img.shields.io/badge/vLLM-native-lightgreen.svg" alt="vLLM: Native Support"></a>
1917
<a href="https://github.com/sgl-project/sglang"><img src="https://img.shields.io/badge/SGLang-native-lightgreen.svg" alt="SGLang: Native Support"></a>
2018
<a href="https://github.com/BerriAI/litellm"><img src="https://img.shields.io/badge/LiteLLM-native-lightgreen.svg" alt="LiteLLM: Native Support"></a>
21-
<a href="https://github.com/lemonade-sdk/lemonade"><img src="https://img.shields.io/badge/Lemonade-native-lightgreen.svg" alt="Lemonade AI: Native Support"></a>
22-
<a href="https://github.com/InternLM/lmdeploy"><img src="https://img.shields.io/badge/LM Deploy-openai-lightblue.svg" alt="Lemonade AI: OpenAI Compatible"></a>
19+
<a href="https://github.com/InternLM/lmdeploy"><img src="https://img.shields.io/badge/LM Deploy-openai-lightblue.svg" alt="LM Deploy: OpenAI Compatible"></a> <br/>
20+
<a href="https://ollama.com"><img src="https://img.shields.io/badge/Ollama-native-lightgreen.svg" alt="Ollama: Native Support"></a>
21+
<a href="https://lmstudio.ai/"><img src="https://img.shields.io/badge/LM Studio-native-lightgreen.svg" alt="LM Studio: Native Support"></a>
22+
<a href="https://github.com/lemonade-sdk/lemonade"><img src="https://img.shields.io/badge/LemonadeSDK-native-lightgreen.svg" alt="LemonadeSDK: Native Support"></a>
2323
</P>
2424
</div>
2525

26-
Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure.
27-
28-
It intelligently routes LLM requests across local and remote inference nodes - including [LlamaCpp](https://github.com/ggerganov/llama.cpp) backends like [Ollama](https://github.com/ollama/ollama), [LM Studio](https://lmstudio.ai/) or [SGLang](https://github.com/sgl-project/sglang) (with RadixAttention), [vLLM](https://github.com/vllm-project/vllm), [Lemonade SDK](https://lemonade-server.ai) (AMD Ryzen AI), [LiteLLM](https://github.com/BerriAI/litellm) and other OpenAI-compatible endpoints.
29-
30-
Olla provides model discovery and unified model catalogues across all providers, enabling seamless routing to available models on compatible endpoints.
26+
Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a [wide variety](https://thushan.github.io/olla/integrations/overview/) of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
3127

32-
With native [LiteLLM support](integrations/backend/litellm.md), Olla bridges local and cloud infrastructure - use local models when available, automatically failover to cloud APIs when needed. Unlike orchestration platforms like [GPUStack](compare/gpustack.md), Olla focuses on making your existing LLM infrastructure reliable through intelligent routing and failover.
28+
Olla works alongside API gateways like [LiteLLM](https://github.com/BerriAI/litellm) or orchestration platforms like [GPUStack](https://github.com/gpustack/gpustack), focusing on making your **existing** LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: **Sherpa** for simplicity and maintainability or **Olla** for maximum performance with advanced features like circuit breakers and connection pooling.
3329

3430
## Key Features
3531

docs/content/integrations/api-translation/anthropic.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ Olla's Anthropic API Translation enables Claude-compatible clients (Claude Code,
4040
<th>Supported Clients</th>
4141
<td>
4242
<ul>
43-
<li><a href="../frontend/claude-code.md">Claude Code</a></li>
44-
<li><a href="../frontend/opencode.md">OpenCode</a></li>
45-
<li><a href="../frontend/crush-cli.md">Crush CLI</a></li>
43+
<li><a href="../../frontend/claude-code">Claude Code</a></li>
44+
<li><a href="../../frontend/opencode">OpenCode</a></li>
45+
<li><a href="../../frontend/crush-cli">Crush CLI</a></li>
4646
<li>Any Anthropic API client</li>
4747
</ul>
4848
</td>

0 commit comments

Comments
 (0)