Internals of embedding, rate limiting, Qdrant search, and conflict detection. For the decision model and conflict logic, see decisions.md. For configuration, see configuration.md. For operational procedures, see runbook.md.
Akashi generates vector embeddings for every decision trace to enable semantic search ("find decisions similar to X"). The embedding provider is selected at startup and used throughout the server's lifetime.
AKASHI_EMBEDDING_PROVIDER=auto (default)
│
├─ Try Ollama (GET /api/tags, 2s timeout)
│ ├─ Reachable → OllamaProvider
│ └─ Unreachable ↓
│
├─ Check OPENAI_API_KEY
│ ├─ Set → OpenAIProvider
│ └─ Empty ↓
│
└─ NoopProvider (zero vectors, semantic search disabled)
Set AKASHI_EMBEDDING_PROVIDER to ollama, openai, or noop to skip auto-detection.
| Provider | Model | Dimensions | Context Window | Data Residency |
|---|---|---|---|---|
OllamaProvider |
mxbai-embed-large |
1024 | 512 tokens | On-premises |
OpenAIProvider |
text-embedding-3-small |
1024 | 8191 tokens | OpenAI servers |
NoopProvider |
N/A | configurable | N/A | N/A |
Both providers truncate input at a word boundary before sending, using a shared truncateText function. This prevents silent failures where the API rejects oversized input and the decision stores with embedding = NULL.
| Provider | Max chars | Approx tokens | Model limit |
|---|---|---|---|
| Ollama | 2,000 | ~500 | 512 tokens |
| OpenAI | 30,000 | ~7,500 | 8,191 tokens |
Ollama also has a server-side safety net: the /api/embed endpoint truncates at the token level if the character-based estimate overshoots.
Decisions are always stored in full — only the embedding input is truncated.
EmbedBatch first tries Ollama's native batch API (/api/embed with an array input). If that fails (e.g., older Ollama versions), it falls back to concurrent single-text requests with a semaphore (max 4 concurrent).
OpenAI's EmbedBatch truncates all texts, then sends them in a single API call (native batch).
On startup, the server queries for decisions with embedding IS NULL and processes them in batches. This handles decisions that were created when the embedding provider was unavailable (e.g., Ollama was down) or before the provider was configured.
The backfill runs once per startup and logs progress:
{"level":"INFO","msg":"backfill: embedded decisions","count":5,"batch":1}
{"level":"INFO","msg":"embedding backfill complete","count":5}
Two embeddings are computed per decision (Option B for conflict detection):
| Embedding | Input | Stored As |
|---|---|---|
| Full | {decision_type}: {outcome} {reasoning} |
decisions.embedding |
| Outcome-only | {outcome} |
decisions.outcome_embedding |
The full embedding powers semantic search and conflict topic similarity. The outcome-only embedding powers conflict outcome divergence. Alternatives and evidence are not included. See decisions.md.
On startup, after the main embedding backfill, the server also backfills outcome_embedding for decisions that have embedding but not outcome_embedding. This handles decisions created before migration 027.
If embedding fails for a decision (provider error, timeout, etc.), the decision is stored with embedding = NULL. It remains queryable via SQL filters and full-text search, but is invisible to semantic (vector) search. The backfill job on next restart will attempt to embed it again.
type Limiter interface {
Allow(ctx context.Context, key string) (bool, error)
Close() error
}Implementations must be safe for concurrent use. Errors are treated as fail-open: a broken limiter does not block traffic.
In-memory token bucket with per-key independent buckets.
- Refill: Tokens are added at
rateper second (default 100). - Burst: Bucket capacity (default 200). A new key starts with a full bucket.
- Eviction: A background goroutine evicts keys not accessed in 10 minutes (runs every minute).
The middleware constructs keys as org:<uuid>:agent:<id>, giving each agent within each org an independent rate limit. Platform admins bypass rate limiting entirely. Unauthenticated paths (health, auth token) pass through.
When rate limited, the server returns 429 Too Many Requests with a JSON error body:
{"error": "rate limit exceeded"}Enterprise deployments replace MemoryLimiter with a Redis-backed implementation for cross-instance coordination. The Limiter interface is the contract — the middleware is unaware of the backing store.
Decisions are stored in PostgreSQL (source of truth) and indexed in Qdrant (derived search index). The outbox pattern ensures eventual consistency without distributed transactions.
POST /v1/trace
│
├─ 1. Decision written to PostgreSQL (with embedding)
├─ 2. Row inserted into search_outbox (same transaction)
│
└─ (async) OutboxWorker polls search_outbox
│
├─ SELECT ... FOR UPDATE SKIP LOCKED (batch, max 100)
├─ Lock entries for 60s
├─ Fetch full decision data from PostgreSQL
├─ Upsert points to Qdrant (or delete)
│
├─ Success → DELETE from search_outbox
└─ Failure → INCREMENT attempts, exponential backoff
(2^attempts seconds, capped at 5 min)
Created automatically on startup if missing. Configuration:
| Property | Value |
|---|---|
| Collection name | akashi_decisions (configurable) |
| Vector size | 1024 (matches embedding dimensions) |
| Distance metric | Cosine similarity |
| HNSW M | 16 |
| HNSW ef_construct | 128 |
Payload indexes (for filtered search): org_id, agent_id, decision_type (keyword); confidence, completeness_score, valid_from_unix (float).
Tenant isolation: every query includes org_id as a required filter.
| Behavior | Value |
|---|---|
| Max attempts | 10 |
| Backoff | Exponential: 2^attempts seconds, capped at 5 minutes |
| Dead-letter cleanup | Entries with attempts >= 10 and older than 7 days are deleted hourly |
| Lock duration | 60 seconds per batch (prevents double-processing) |
Dead-lettered entries remain in PostgreSQL and are queryable via SQL. They are not indexed in Qdrant until manually reset:
UPDATE search_outbox
SET attempts = 0, locked_until = NULL, last_error = NULL
WHERE attempts >= 10;Raw Qdrant similarity scores are adjusted before returning results:
outcome_weight =
0.40 * assessment_score (explicit feedback; 0 if no assessments)
0.25 * log1p(citations)/log(6) (logarithmic, saturates at 5 citations)
0.15 * stability_score (0 if superseded within 48h)
0.10 * agreement_score (min(AgreementCount/3, 1))
0.10 * conflict_win_rate (0 if no conflict history)
relevance = similarity × (0.5 + 0.5×outcome_weight) × recency_decay
recency_decay = 1 / (1 + age_days/90)
- Assessment (primary, 40%): Explicit correctness feedback from
akashi_assess. Contributes 0 when no assessments exist. - Citations (25%): Logarithmic — first citation worth more than later ones.
- Stability (15%): Decisions superseded within 48h of creation score 0.
- Agreement / conflict win rate: Minor boosts based on consensus signals.
- Recency decay: Decisions lose relevance with a 90-day half-life.
- Over-fetch: Qdrant returns
limit * 3results; re-scoring and truncation happen in Go.
On shutdown, the outbox worker:
- Cancels the poll loop.
- Runs one final
processBatchwith the caller's drain context (respects deadline). - Signals completion via the
donechannel.
If the drain context expires before the final batch completes, the log emits "search outbox: drain timed out". Remaining entries stay in the outbox and sync on next startup.
When QDRANT_URL is empty, the outbox worker is not started and POST /v1/search falls back to PostgreSQL full-text search (tsvector with GIN index) plus ILIKE matching. Semantic similarity is unavailable; results are ranked by text relevance only.
For the full conflict detection pipeline (candidate retrieval, significance scoring, LLM validation, claim-level analysis, resolution, analytics, and observability), see conflicts.md.