-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
RFC: Thompson Sampling for observation quality optimization #1571
Description
Proposal
Use Thompson Sampling (multi-armed bandit) to dynamically learn which model produces the best observations for each observation type, building on the tier routing and feedback collection infrastructure.
Context
With tier routing (#1569), observations are processed by different models based on queue complexity. But the optimal model per observation type is unknown a priori — it depends on usage patterns that vary per user and project.
The observation_feedback table (#1569) already collects usage signals. This RFC proposes using those signals as reward/penalty for a bandit algorithm.
Design
Arms: {observation_type}:{model} (e.g., discovery:haiku, completion:sonnet, summary:opus)
Reward (alpha += 1):
- Observation accessed via
get_observationswithin 7 days - Observation injected via semantic search AND Claude referenced it in response
Penalty (beta += 1):
- Observation never accessed in 30 days
- Observation injected but Claude ignored it
Selection: Thompson Sampling — sample from Beta(alpha, beta) for each arm, select the highest.
Schema addition:
CREATE TABLE bandit_arms (
arm_id TEXT PRIMARY KEY, -- e.g., 'discovery:haiku'
alpha REAL DEFAULT 1.0, -- success count + prior
beta REAL DEFAULT 1.0, -- failure count + prior
updated_at_epoch INTEGER
);Architecture
Observation arrives
→ Bandit selects model for this obs_type
→ SDK Agent processes with selected model
→ Observation stored
7 days later (cron or lazy evaluation):
→ Feedback signals checked for this observation
→ Was it accessed/used? → reward (alpha += 1)
→ Never accessed? → penalty (beta += 1)
→ Bandit priors updated
Inspiration
Adapted from an internal project using Thompson Sampling for model routing in a Telegram bot. The bandit approach learned optimal model selection over ~2 weeks with user reactions as reward signal. The same principle applies here with observation usage as the reward.
Streak Detection (complementary)
Different failure types need different responses:
| Failure type | Pattern | Action |
|---|---|---|
| API timeout | ETIMEDOUT | Exponential backoff |
| Rate limit | 429 | Pause 60s + model fallback |
| Content policy | content_policy | Skip observation (don't retry) |
| Context overflow | context_window | Reduce prompt size |
| Model error | 500, overloaded | Fallback to alternative model |
Prerequisites
- Tier routing merged (feat: tier routing by queue complexity + observation feedback table #1569) — provides model-per-type infrastructure
- Feedback collection active — provides usage signals
- Sufficient data: ~2 weeks of feedback signals for meaningful priors
Open Questions
- Should the bandit also optimize prompt templates, not just models?
- What's the right exploration rate for ~120 observations/day?
- Should we warm-start from known priors (e.g., summaries always prefer higher-quality models)?
- How to handle cold-start for new observation types?
Implementation Order
- Bandit Engine — core Thompson Sampling logic + bandit_arms table
- Reward Evaluator — cron/lazy evaluation of feedback signals
- Integration — replace static tier routing with bandit selection
- Patrol — monitoring dashboard for arm performance
- Streak Detection — failure-type-specific responses