Merge pull request #78 from thushan/prepare/v0.0.20

thushan · web-flow · commit df959f0c64c2 · 2025-10-22T19:09:53.000+11:00
prepare: v0.0.20
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,16 +1,27 @@
 # CLAUDE.md
 
 ## Overview
-Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, OpenAI-compatible endpoints). 
+Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, SGLang, Llamacpp, Lemonade, Anthropic, and OpenAI-compatible endpoints).
 
 The project provides two proxy engines: Sherpa (simple, maintainable) and Olla (high-performance with advanced features).
 
+Full documentation available at: https://thushan.github.io/olla/
+
 ## Commands
 ```bash
-make ready        # Run before commit (test + lint + fmt)
-make dev          # Development mode (auto-reload)
-make test         # Run all tests
-make bench        # Run benchmarks
+make ready           # Run before commit (test-short + test-race + fmt + lint + align)
+make ready-tools     # Check code with tools only (fmt + lint + align)
+make test            # Run all tests
+make test-race       # Run tests with race detection
+make test-stress     # Run comprehensive stress tests
+make bench           # Run all benchmarks
+make bench-balancer  # Run balancer benchmarks
+make build           # Build optimised binary with version info
+make build-local     # Build binary to ./build/ (fast, for testing)
+make run             # Run with version info
+make run-debug       # Run with debug logging
+make ci              # Run full CI pipeline locally
+make help            # Show all available targets
 ```
 
 ## Project Structure
@@ -21,57 +32,71 @@ olla/
 ├── config/
 │   ├── profiles/            # Provider-specific profiles
 │   │   ├── ollama.yaml     # Ollama configuration
+│   │   ├── llamacpp.yaml   # llama.cpp configuration
 │   │   ├── lmstudio.yaml   # LM Studio configuration
+│   │   ├── lemonade.yaml   # Lemonade SDK configuration
 │   │   ├── litellm.yaml    # LiteLLM gateway configuration
-│   │   ├── openai.yaml     # OpenAI-compatible configuration
-│   │   └── vllm.yaml       # vLLM configuration
-│   └── models.yaml         # Model configurations
+│   │   ├── vllm.yaml       # vLLM configuration
+│   │   ├── sglang.yaml     # SGLang configuration
+│   │   ├── anthropic.yaml  # Anthropic Claude API configuration
+│   │   └── openai.yaml     # OpenAI-compatible generic profile
+│   ├── models.yaml         # Model configurations
+│   └── config.local.yaml   # Local configuration overrides (user, not committed to git)
 ├── internal/
 │   ├── core/               # Domain layer (business logic)
 │   │   ├── domain/         # Core entities
-│   │   │   ├── endpoint.go         # Endpoint management
-│   │   │   ├── model.go            # Model registry
-│   │   │   ├── unified_model.go    # Unified model format
-│   │   │   └── routing.go          # Request routing logic
 │   │   ├── ports/          # Interface definitions
 │   │   └── constants/      # Application constants
 │   ├── adapter/            # Infrastructure layer
 │   │   ├── balancer/       # Load balancing strategies
-│   │   │   ├── priority.go         # Priority-based selection
-│   │   │   ├── round_robin.go      # Round-robin selection
-│   │   │   └── least_connections.go # Least connections selection
+│   │   ├── converter/      # Model format converters
+│   │   ├── discovery/      # Service discovery
+│   │   ├── factory/        # Factory patterns
+│   │   ├── filter/         # Request/response filtering
+│   │   ├── health/         # Health checking & circuit breakers
+│   │   ├── inspector/      # Request inspection
+│   │   ├── metrics/        # Metrics collection
 │   │   ├── proxy/          # Proxy implementations
 │   │   │   ├── sherpa/     # Simple, maintainable proxy
 │   │   │   ├── olla/       # High-performance proxy
 │   │   │   └── core/       # Shared proxy components
-│   │   ├── health/         # Health checking
-│   │   │   ├── checker.go          # Health check coordinator
-│   │   │   └── circuit_breaker.go  # Circuit breaker implementation
-│   │   ├── discovery/      # Service discovery
-│   │   │   └── service.go          # Model discovery service
 │   │   ├── registry/       # Model & profile registries
-│   │   │   ├── profile/            # Provider profiles
-│   │   │   └── unified_memory_registry.go # Unified model registry
-│   │   ├── unifier/        # Model unification
-│   │   ├── converter/      # Model format converters
-│   │   ├── inspector/      # Request inspection
-│   │   ├── security/       # Security features
-│   │   │   ├── request_rate_limit.go  # Rate limiting
-│   │   │   └── request_size_limit.go  # Size limiting
-│   │   └── stats/          # Statistics collection
-│   │       ├── collector.go        # Main stats collector
-│   │       └── model_collector.go  # Model-specific stats
-│   └── app/                # Application layer
-│       ├── app.go          # Service manager
-│       └── handlers/       # HTTP handlers
-│           ├── handler_proxy.go    # Main proxy handler
-│           ├── handler_status.go   # Status endpoints
-│           └── handler_health.go   # Health endpoints
+│   │   ├── security/       # Security features (rate/size limits)
+│   │   ├── stats/          # Statistics collection
+│   │   ├── translator/     # API translation layer (OpenAI ↔ Provider)
+│   │   └── unifier/        # Model unification
+│   ├── app/                # Application layer
+│   │   ├── handlers/       # HTTP handlers
+│   │   │   ├── server.go              # HTTP server setup
+│   │   │   ├── server_routes.go       # Route registration
+│   │   │   ├── handler_proxy.go       # Main proxy handler
+│   │   │   ├── handler_provider_*.go  # Provider-specific handlers
+│   │   │   ├── handler_translation.go # Translation handler
+│   │   │   ├── handler_status*.go     # Status endpoints
+│   │   │   ├── handler_health.go      # Health endpoints
+│   │   │   └── handler_version.go     # Version information
+│   │   ├── middleware/     # HTTP middleware
+│   │   └── services/       # Application services
+│   ├── config/             # Configuration management
+│   ├── env/                # Environment handling
+│   ├── integration/        # Integration tests
+│   ├── logger/             # Logging framework
+│   ├── router/             # Routing logic
+│   ├── util/               # Utilities
+│   └── version/            # Version management
 ├── pkg/                    # Reusable packages
+│   ├── container/         # Dependency injection
+│   ├── eventbus/          # Event bus (pub/sub)
+│   ├── format/            # Formatting utilities
+│   ├── nerdstats/         # Process statistics
 │   ├── pool/              # Object pooling
-│   └── nerdstats/         # Process statistics
+│   └── profiler/          # Profiling support
 └── test/
     └── scripts/           # Test scripts
+        ├── anthropic/     # Anthropic API tests
+        ├── cases/         # Test cases
+        ├── inspector/     # Inspector tests
+        ├── load/          # Load testing
         ├── logic/         # Logic & routing tests
         ├── security/      # Security tests
         └── streaming/     # Streaming tests
@@ -80,49 +105,90 @@ olla/
 ## Key Files
 - `main.go` - Application entry point
 - `config.yaml` - Main configuration
+- `internal/app/handlers/server_routes.go` - Route registration & API setup
 - `internal/app/handlers/handler_proxy.go` - Request routing logic
-- `internal/adapter/proxy/sherpa/service.go` - Sherpa proxy
-- `internal/adapter/proxy/olla/service.go` - Olla proxy
+- `internal/adapter/proxy/sherpa/service.go` - Sherpa proxy implementation
+- `internal/adapter/proxy/olla/service.go` - Olla proxy implementation
+- `internal/adapter/translator/` - API translation layer (OpenAI ↔ Provider formats)
+- `internal/version/version.go` - Version information embedded at build time
 - `/test/scripts/logic/test-model-routing.sh` - Test routing & headers
 
+## API Endpoints
+
+### Internal Endpoints
+- `/internal/health` - Health check endpoint
+- `/internal/status` - Endpoint status
+- `/internal/status/endpoints` - Endpoints status details
+- `/internal/status/models` - Models status details
+- `/internal/stats/models` - Model statistics
+- `/internal/process` - Process statistics
+- `/version` - Version information
+
+### Unified Model Endpoints
+- `/olla/models` - Unified models listing with filtering
+- `/olla/models/{id}` - Get unified model by ID or alias
+
+### Proxy Endpoints
+- `/olla/proxy/` - Olla API proxy endpoint (POST)
+- `/olla/proxy/v1/models` - OpenAI-compatible models listing (GET)
+
+### Translator Endpoints
+Dynamically registered based on configured translators (e.g., Anthropic Messages API)
+
 ## Response Headers
 - `X-Olla-Endpoint`: Backend name
 - `X-Olla-Model`: Model used
-- `X-Olla-Backend-Type`: ollama/openai/lmstudio/vllm/litellm
+- `X-Olla-Backend-Type`: ollama/openai/lmstudio/vllm/litellm/sglang/llamacpp/lemonade/anthropic
 - `X-Olla-Request-ID`: Request ID
 - `X-Olla-Response-Time`: Total processing time
 
 ## Testing
-- Unit tests: Components in isolation
-- Integration: Full request flow
-- Benchmarks: Performance comparison
-- Always run `make ready` before commit
 
 ### Testing Strategy
-
-1. **Unit Tests**: Test individual components in isolation
-2. **Integration Tests**: Test full request flow through the proxy
-3. **Benchmark Tests**:
-  - Performance of critical paths
-  - Proxy engine comparisons
-  - Connection pooling efficiency
-  - Circuit breaker behavior
-4. **Security Tests**: Validate rate limiting and size restrictions (see `/test/scripts/security/`)
-5. **Shared Proxy Tests**: Common test suite for both proxy engines ensuring compatibility
+1. **Unit Tests**: Components in isolation
+2. **Integration Tests**: Full request flow through proxy engines
+3. **Benchmark Tests**: Performance comparison (balancers, proxy engines, repositories)
+4. **Security Tests**: Rate limiting and size restrictions (see `/test/scripts/security/`)
+5. **Stress Tests**: Comprehensive testing under load
+6. **Script Tests**: End-to-end scenarios in `/test/scripts/`
 
 ### Testing Commands
+```bash
+# Core test commands
+make test              # Run all tests
+make test-race         # Run with race detection
+make test-stress       # Run stress tests
+make test-cover-html   # Generate coverage HTML report
 
-```
-# Run proxy engine tests
+# Benchmark commands
+make bench             # Run all benchmarks
+make bench-balancer    # Run balancer benchmarks
+make bench-repo        # Run repository benchmarks
+
+# Specific test patterns
 go test -v ./internal/adapter/proxy -run TestAllProxies
 go test -v ./internal/adapter/proxy -run TestSherpa
 go test -v ./internal/adapter/proxy -run TestOlla
 ```
 
-## Notes
+Always run `make ready` before committing changes.
+
+## Architecture Notes
+
+### Hexagonal Architecture
+- **Domain Layer** (`internal/core`): Business logic, entities, and interfaces
+- **Infrastructure Layer** (`internal/adapter`): Implementations (proxies, balancers, registries)
+- **Application Layer** (`internal/app`): HTTP handlers, middleware, and services
+
+### Key Components
+- **Translator Layer**: Enables API format translation (e.g., OpenAI ↔ Anthropic)
+- **Proxy Engines**: Choose Sherpa (simple) or Olla (high-performance)
+- **Load Balancing**: Priority-based recommended for production
+- **Version Management**: Build-time version injection via `internal/version`
+
+### Development Guidelines
 - Go 1.24+
-- Endpoints: `/internal/health`, `/internal/status`
-- Proxy prefix: `/olla/`
-- Priority balancer recommended for production
-- Australian English for comments and documentation, comment on why rather than what.
-- Use `make ready` before committing changes to ensure code quality
+- Australian English for comments and documentation
+- Comment on **why** rather than **what**
+- Always run `make ready` before committing
+- Use `make help` to see all available commands
diff --git a/config/config.yaml b/config/config.yaml
@@ -29,7 +29,7 @@ proxy:
   # NOTE: From v0.1.0+ we'll switch to Olla as the default proxy engine
   #       Sherpa will continue to be maintained and supported for the
   #       foreseeable future
-  engine: "sherpa" # Available: sherpa, olla
+  engine: "olla" # Available: sherpa, olla
   # Profile controls proxy engine (http) transport buffer behaviour
   # - "auto": Dynamically selects based on request size, type and other factors (default)
   # - "streaming": No buffering, tokens stream immediately, low latency & low memory usage
diff --git a/docs/content/index.md b/docs/content/index.md
@@ -12,24 +12,20 @@ keywords: llm proxy, ollama proxy, lm studio proxy, vllm proxy, sglang, lemonade
     <a href="https://github.com/thushan/olla/actions/workflows/ci.yml"><img src="https://github.com/thushan/olla/actions/workflows/ci.yml/badge.svg?branch=main" alt="CI"></a>
     <a href="https://goreportcard.com/report/github.com/thushan/olla"><img src="https://goreportcard.com/badge/github.com/thushan/olla" alt="Go Report Card"></a>
     <a href="https://github.com/thushan/olla/releases/latest"><img src="https://img.shields.io/github/release/thushan/olla" alt="Latest Release"></a> <br />
-    <a href="https://ollama.com"><img src="https://img.shields.io/badge/Ollama-native-lightgreen.svg" alt="Ollama: Native Support"></a>
-    <a href="https://lmstudio.ai/"><img src="https://img.shields.io/badge/LM Studio-native-lightgreen.svg" alt="LM Studio: Native Support"></a>
     <a href="https://github.com/ggerganov/llama.cpp"><img src="https://img.shields.io/badge/llama.cpp-native-lightgreen.svg" alt="llama.cpp: Native Support"></a>
     <a href="https://github.com/vllm-project/vllm"><img src="https://img.shields.io/badge/vLLM-native-lightgreen.svg" alt="vLLM: Native Support"></a>
     <a href="https://github.com/sgl-project/sglang"><img src="https://img.shields.io/badge/SGLang-native-lightgreen.svg" alt="SGLang: Native Support"></a>
     <a href="https://github.com/BerriAI/litellm"><img src="https://img.shields.io/badge/LiteLLM-native-lightgreen.svg" alt="LiteLLM: Native Support"></a>
-    <a href="https://github.com/lemonade-sdk/lemonade"><img src="https://img.shields.io/badge/Lemonade-native-lightgreen.svg" alt="Lemonade AI: Native Support"></a>
-    <a href="https://github.com/InternLM/lmdeploy"><img src="https://img.shields.io/badge/LM Deploy-openai-lightblue.svg" alt="Lemonade AI: OpenAI Compatible"></a> 
+    <a href="https://github.com/InternLM/lmdeploy"><img src="https://img.shields.io/badge/LM Deploy-openai-lightblue.svg" alt="LM Deploy: OpenAI Compatible"></a> <br/>
+    <a href="https://ollama.com"><img src="https://img.shields.io/badge/Ollama-native-lightgreen.svg" alt="Ollama: Native Support"></a>
+    <a href="https://lmstudio.ai/"><img src="https://img.shields.io/badge/LM Studio-native-lightgreen.svg" alt="LM Studio: Native Support"></a>
+    <a href="https://github.com/lemonade-sdk/lemonade"><img src="https://img.shields.io/badge/LemonadeSDK-native-lightgreen.svg" alt="LemonadeSDK: Native Support"></a>    
   </P>
 </div>
 
-Olla is a high-performance, low-overhead, low-latency proxy, model unifier and load balancer for managing LLM infrastructure. 
-
-It intelligently routes LLM requests across local and remote inference nodes - including [LlamaCpp](https://github.com/ggerganov/llama.cpp) backends like [Ollama](https://github.com/ollama/ollama), [LM Studio](https://lmstudio.ai/) or [SGLang](https://github.com/sgl-project/sglang) (with RadixAttention), [vLLM](https://github.com/vllm-project/vllm), [Lemonade SDK](https://lemonade-server.ai) (AMD Ryzen AI), [LiteLLM](https://github.com/BerriAI/litellm) and other OpenAI-compatible endpoints.
-
-Olla provides model discovery and unified model catalogues across all providers, enabling seamless routing to available models on compatible endpoints.
+Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a [wide variety](https://thushan.github.io/olla/integrations/overview/) of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.
 
-With native [LiteLLM support](integrations/backend/litellm.md), Olla bridges local and cloud infrastructure - use local models when available, automatically failover to cloud APIs when needed. Unlike orchestration platforms like [GPUStack](compare/gpustack.md), Olla focuses on making your existing LLM infrastructure reliable through intelligent routing and failover.
+Olla works alongside API gateways like [LiteLLM](https://github.com/BerriAI/litellm) or orchestration platforms like [GPUStack](https://github.com/gpustack/gpustack), focusing on making your **existing** LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: **Sherpa** for simplicity and maintainability or **Olla** for maximum performance with advanced features like circuit breakers and connection pooling.
 
 ## Key Features
 
diff --git a/docs/content/integrations/api-translation/anthropic.md b/docs/content/integrations/api-translation/anthropic.md
@@ -40,9 +40,9 @@ Olla's Anthropic API Translation enables Claude-compatible clients (Claude Code,
         <th>Supported Clients</th>
         <td>
             <ul>
-                <li><a href="../frontend/claude-code.md">Claude Code</a></li>
-                <li><a href="../frontend/opencode.md">OpenCode</a></li>
-                <li><a href="../frontend/crush-cli.md">Crush CLI</a></li>
+                <li><a href="../../frontend/claude-code">Claude Code</a></li>
+                <li><a href="../../frontend/opencode">OpenCode</a></li>
+                <li><a href="../../frontend/crush-cli">Crush CLI</a></li>
                 <li>Any Anthropic API client</li>
             </ul>
         </td>
diff --git a/readme.md b/readme.md