Skip to content

RFC: Thompson Sampling for observation quality optimization #1571

@alessandropcostabr

Description

@alessandropcostabr

Proposal

Use Thompson Sampling (multi-armed bandit) to dynamically learn which model produces the best observations for each observation type, building on the tier routing and feedback collection infrastructure.

Context

With tier routing (#1569), observations are processed by different models based on queue complexity. But the optimal model per observation type is unknown a priori — it depends on usage patterns that vary per user and project.

The observation_feedback table (#1569) already collects usage signals. This RFC proposes using those signals as reward/penalty for a bandit algorithm.

Design

Arms: {observation_type}:{model} (e.g., discovery:haiku, completion:sonnet, summary:opus)

Reward (alpha += 1):

  • Observation accessed via get_observations within 7 days
  • Observation injected via semantic search AND Claude referenced it in response

Penalty (beta += 1):

  • Observation never accessed in 30 days
  • Observation injected but Claude ignored it

Selection: Thompson Sampling — sample from Beta(alpha, beta) for each arm, select the highest.

Schema addition:

CREATE TABLE bandit_arms (
  arm_id TEXT PRIMARY KEY,         -- e.g., 'discovery:haiku'
  alpha REAL DEFAULT 1.0,          -- success count + prior
  beta REAL DEFAULT 1.0,           -- failure count + prior
  updated_at_epoch INTEGER
);

Architecture

Observation arrives
  → Bandit selects model for this obs_type
  → SDK Agent processes with selected model
  → Observation stored

7 days later (cron or lazy evaluation):
  → Feedback signals checked for this observation
  → Was it accessed/used? → reward (alpha += 1)
  → Never accessed? → penalty (beta += 1)
  → Bandit priors updated

Inspiration

Adapted from an internal project using Thompson Sampling for model routing in a Telegram bot. The bandit approach learned optimal model selection over ~2 weeks with user reactions as reward signal. The same principle applies here with observation usage as the reward.

Streak Detection (complementary)

Different failure types need different responses:

Failure type Pattern Action
API timeout ETIMEDOUT Exponential backoff
Rate limit 429 Pause 60s + model fallback
Content policy content_policy Skip observation (don't retry)
Context overflow context_window Reduce prompt size
Model error 500, overloaded Fallback to alternative model

Prerequisites

Open Questions

  1. Should the bandit also optimize prompt templates, not just models?
  2. What's the right exploration rate for ~120 observations/day?
  3. Should we warm-start from known priors (e.g., summaries always prefer higher-quality models)?
  4. How to handle cold-start for new observation types?

Implementation Order

  1. Bandit Engine — core Thompson Sampling logic + bandit_arms table
  2. Reward Evaluator — cron/lazy evaluation of feedback signals
  3. Integration — replace static tier routing with bandit selection
  4. Patrol — monitoring dashboard for arm performance
  5. Streak Detection — failure-type-specific responses

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions