RFC: Thompson Sampling for observation quality optimization

## Proposal

Use Thompson Sampling (multi-armed bandit) to dynamically learn which model produces the best observations for each observation type, building on the tier routing and feedback collection infrastructure.

## Context

With tier routing (#1569), observations are processed by different models based on queue complexity. But the optimal model per observation type is unknown a priori — it depends on usage patterns that vary per user and project.

The `observation_feedback` table (#1569) already collects usage signals. This RFC proposes using those signals as reward/penalty for a bandit algorithm.

## Design

**Arms:** `{observation_type}:{model}` (e.g., `discovery:haiku`, `completion:sonnet`, `summary:opus`)

**Reward (alpha += 1):**
- Observation accessed via `get_observations` within 7 days
- Observation injected via semantic search AND Claude referenced it in response

**Penalty (beta += 1):**
- Observation never accessed in 30 days
- Observation injected but Claude ignored it

**Selection:** Thompson Sampling — sample from `Beta(alpha, beta)` for each arm, select the highest.

**Schema addition:**
```sql
CREATE TABLE bandit_arms (
  arm_id TEXT PRIMARY KEY,         -- e.g., 'discovery:haiku'
  alpha REAL DEFAULT 1.0,          -- success count + prior
  beta REAL DEFAULT 1.0,           -- failure count + prior
  updated_at_epoch INTEGER
);
```

## Architecture

```
Observation arrives
  → Bandit selects model for this obs_type
  → SDK Agent processes with selected model
  → Observation stored

7 days later (cron or lazy evaluation):
  → Feedback signals checked for this observation
  → Was it accessed/used? → reward (alpha += 1)
  → Never accessed? → penalty (beta += 1)
  → Bandit priors updated
```

## Inspiration

Adapted from an internal project using Thompson Sampling for model routing in a Telegram bot. The bandit approach learned optimal model selection over ~2 weeks with user reactions as reward signal. The same principle applies here with observation usage as the reward.

## Streak Detection (complementary)

Different failure types need different responses:

| Failure type | Pattern | Action |
|---|---|---|
| API timeout | ETIMEDOUT | Exponential backoff |
| Rate limit | 429 | Pause 60s + model fallback |
| Content policy | content_policy | Skip observation (don't retry) |
| Context overflow | context_window | Reduce prompt size |
| Model error | 500, overloaded | Fallback to alternative model |

## Prerequisites

- Tier routing merged (#1569) — provides model-per-type infrastructure
- Feedback collection active — provides usage signals
- Sufficient data: ~2 weeks of feedback signals for meaningful priors

## Open Questions

1. Should the bandit also optimize prompt templates, not just models?
2. What's the right exploration rate for ~120 observations/day?
3. Should we warm-start from known priors (e.g., summaries always prefer higher-quality models)?
4. How to handle cold-start for new observation types?

## Implementation Order

1. **Bandit Engine** — core Thompson Sampling logic + bandit_arms table
2. **Reward Evaluator** — cron/lazy evaluation of feedback signals
3. **Integration** — replace static tier routing with bandit selection
4. **Patrol** — monitoring dashboard for arm performance
5. **Streak Detection** — failure-type-specific responses

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Thompson Sampling for observation quality optimization #1571

Proposal

Context

Design

Architecture

Inspiration

Streak Detection (complementary)

Prerequisites

Open Questions

Implementation Order

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Failure type	Pattern	Action
API timeout	ETIMEDOUT	Exponential backoff
Rate limit	429	Pause 60s + model fallback
Content policy	content_policy	Skip observation (don't retry)
Context overflow	context_window	Reduce prompt size
Model error	500, overloaded	Fallback to alternative model

Uh oh!

RFC: Thompson Sampling for observation quality optimization #1571

Description

Proposal

Context

Design

Architecture

Inspiration

Streak Detection (complementary)

Prerequisites

Open Questions

Implementation Order

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions