Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, SGLang, Llamacpp, Lemonade, Anthropic, and OpenAI-compatible endpoints).
The project provides two proxy engines: Sherpa (simple, maintainable) and Olla (high-performance with advanced features).
Full documentation available at: https://thushan.github.io/olla/
make ready # Run before commit (test-short + test-race + fmt + vet + lint + align)
make ready-tools # Check code with tools only (fmt + vet + lint + align)
make test # Run all tests
make test-race # Run tests with race detection
make test-stress # Run comprehensive stress tests
make bench # Run all benchmarks
make bench-balancer # Run balancer benchmarks
make build # Build optimised binary with version info
make build-local # Build binary to ./build/ (fast, for testing)
make run # Run with version info
make run-debug # Run with debug logging
make ci # Run full CI pipeline locally
make help # Show all available targetsolla/
├── main.go # Entry point, initialises services
├── config.yaml # Default configuration
├── config/
│ ├── profiles/ # Provider-specific profiles
│ │ ├── ollama.yaml # Ollama configuration
│ │ ├── llamacpp.yaml # llama.cpp configuration
│ │ ├── lmstudio.yaml # LM Studio configuration
│ │ ├── lemonade.yaml # Lemonade SDK configuration
│ │ ├── litellm.yaml # LiteLLM gateway configuration
│ │ ├── vllm.yaml # vLLM configuration
│ │ ├── sglang.yaml # SGLang configuration
│ │ ├── openai-compatible.yaml # OpenAI-compatible generic profile
│ │ └── openai.yaml # OpenAI profile
│ ├── models.yaml # Model configurations
│ └── config.local.yaml # Local configuration overrides (user, not committed to git)
├── internal/
│ ├── core/ # Domain layer (business logic)
│ │ ├── domain/ # Core entities
│ │ ├── ports/ # Interface definitions
│ │ └── constants/ # Application constants
│ ├── adapter/ # Infrastructure layer
│ │ ├── balancer/ # Load balancing strategies
│ │ ├── converter/ # Model format converters
│ │ ├── discovery/ # Service discovery
│ │ ├── factory/ # Factory patterns
│ │ ├── filter/ # Request/response filtering
│ │ ├── health/ # Health checking & circuit breakers
│ │ ├── inspector/ # Request inspection
│ │ ├── metrics/ # Metrics collection
│ │ ├── proxy/ # Proxy implementations
│ │ │ ├── sherpa/ # Simple, maintainable proxy
│ │ │ ├── olla/ # High-performance proxy
│ │ │ └── core/ # Shared proxy components
│ │ ├── registry/ # Model & profile registries
│ │ ├── security/ # Security features (rate/size limits)
│ │ ├── stats/ # Statistics collection
│ │ ├── translator/ # API translation layer (OpenAI ↔ Provider)
│ │ └── unifier/ # Model unification
│ ├── app/ # Application layer
│ │ ├── handlers/ # HTTP handlers
│ │ │ ├── server.go # HTTP server setup
│ │ │ ├── server_routes.go # Route registration
│ │ │ ├── handler_proxy.go # Main proxy handler
│ │ │ ├── handler_provider_*.go # Provider-specific handlers
│ │ │ ├── handler_translation.go # Translation handler
│ │ │ ├── handler_status*.go # Status endpoints
│ │ │ ├── handler_health.go # Health endpoints
│ │ │ └── handler_version.go # Version information
│ │ ├── middleware/ # HTTP middleware
│ │ └── services/ # Application services
│ ├── config/ # Configuration management
│ ├── env/ # Environment handling
│ ├── integration/ # Integration tests
│ ├── logger/ # Logging framework
│ ├── router/ # Routing logic
│ ├── util/ # Utilities
│ └── version/ # Version management
├── pkg/ # Reusable packages
│ ├── container/ # Dependency injection
│ ├── eventbus/ # Event bus (pub/sub)
│ ├── format/ # Formatting utilities
│ ├── nerdstats/ # Process statistics
│ ├── pool/ # Object pooling
│ └── profiler/ # Profiling support
└── test/
└── scripts/ # Test scripts
├── auth/ # Authentication tests
├── cases/ # Test cases
├── load/ # Load testing
├── logic/ # Logic & routing tests
├── platform/ # Platform-specific tests
├── security/ # Security tests
└── streaming/ # Streaming tests
main.go- Application entry pointconfig.yaml- Main configurationinternal/app/handlers/server_routes.go- Route registration & API setupinternal/app/handlers/handler_proxy.go- Request routing logicinternal/app/handlers/handler_translation.go- Translation handler with passthrough logicinternal/adapter/proxy/sherpa/service.go- Sherpa proxy implementationinternal/adapter/proxy/olla/service.go- Olla proxy implementationinternal/adapter/translator/- API translation layer (OpenAI ↔ Provider formats)internal/adapter/translator/types.go- PassthroughCapable interface and translator typesinternal/adapter/translator/anthropic/- Anthropic translator implementationinternal/adapter/stats/translator_collector.go- Translator metrics collectorinternal/core/constants/translator.go- TranslatorMode and FallbackReason constantsinternal/core/ports/stats.go- StatsCollector interface with translator trackinginternal/core/domain/profile_config.go- AnthropicSupportConfig for backend profilesconfig/profiles/*.yaml- Backend profiles withanthropic_supportsectionsinternal/version/version.go- Version information embedded at build time/test/scripts/logic/test-model-routing.sh- Test routing & headers
/internal/health- Health check endpoint/internal/status- Endpoint status/internal/status/endpoints- Endpoints status details/internal/status/models- Models status details/internal/stats/models- Model statistics/internal/stats/translators- Translator statistics/internal/process- Process statistics/version- Version information
/olla/models- Unified models listing with filtering/olla/models/{id}- Get unified model by ID or alias
/olla/proxy/- Olla API proxy endpoint (POST)/olla/proxy/v1/models- OpenAI-compatible models listing (GET)
Dynamically registered based on configured translators (e.g., Anthropic Messages API)
/olla/anthropic/v1/messages- Anthropic Messages API (POST) - supports passthrough and translation modes/olla/anthropic/v1/models- List models in Anthropic format (GET)/olla/anthropic/v1/messages/count_tokens- Token count estimation (POST)
X-Olla-Endpoint: Backend nameX-Olla-Model: Model usedX-Olla-Backend-Type: ollama/openai/openai-compatible/lm-studio/vllm/sglang/llamacpp/lemonadeX-Olla-Request-ID: Request IDX-Olla-Response-Time: Total processing timeX-Olla-Mode: Translator mode used (passthroughor absent for translation) - set on Anthropic translator requestsX-Olla-Routing-Strategy: Routing strategy used (when model routing is active)X-Olla-Routing-Decision: Routing decision made (routed/fallback/rejected)X-Olla-Routing-Reason: Human-readable reason for routing decision
- Unit Tests: Components in isolation
- Integration Tests: Full request flow through proxy engines
- Benchmark Tests: Performance comparison (balancers, proxy engines, repositories)
- Security Tests: Rate limiting and size restrictions (see
/test/scripts/security/) - Stress Tests: Comprehensive testing under load
- Script Tests: End-to-end scenarios in
/test/scripts/
# Core test commands
make test # Run all tests
make test-race # Run with race detection
make test-stress # Run stress tests
make test-cover-html # Generate coverage HTML report
# Benchmark commands
make bench # Run all benchmarks
make bench-balancer # Run balancer benchmarks
make bench-repo # Run repository benchmarks
# Specific test patterns
go test -v ./internal/adapter/proxy -run TestAllProxies
go test -v ./internal/adapter/proxy -run TestSherpa
go test -v ./internal/adapter/proxy -run TestOllaAlways run make ready before committing changes.
- Domain Layer (
internal/core): Business logic, entities, and interfaces - Infrastructure Layer (
internal/adapter): Implementations (proxies, balancers, registries) - Application Layer (
internal/app): HTTP handlers, middleware, and services
- Translator Layer: Enables API format translation (e.g., OpenAI ↔ Anthropic) with passthrough optimisation for backends with native support
- Passthrough Mode: When a backend natively supports the Anthropic Messages API (vLLM, llama.cpp, LM Studio, Ollama), requests bypass translation entirely
- Translator Metrics: Thread-safe per-translator statistics tracking passthrough/translation rates, fallback reasons, latency, and streaming breakdown (
internal/adapter/stats/translator_collector.go) - Proxy Engines: Choose Sherpa (simple) or Olla (high-performance)
- Load Balancing: Priority-based recommended for production
- Version Management: Build-time version injection via
internal/version
- Go 1.24+
- Australian English for comments and documentation
- Comment on why rather than what
- Always run
make readybefore committing - Use
make helpto see all available commands
CRITICAL: Always delegate tasks to the appropriate subagent. Do NOT perform work directly in the main context.
- Code Review → Use the appropriate language subagent (Eg. Go Architect) or reviewer subagent
- Code changes → Use the appropriate language subagent (Eg. Go Architect) or implementer subagent
- Research/exploration → Use the explore subagent
- Testing → Use the test subagent
Only use the main context for orchestration and task decomposition.