Skip to content

Redis-backed bandit state for horizontal scaling #27

@evanvolgas

Description

@evanvolgas

Overview

Implement Redis-backed bandit state synchronization to enable multi-instance horizontal scaling.

Background

From STAFF_REVIEW.md: "Single-node, in-memory state breaks at scale"

Current Architecture (Single Instance)

Single instance          Problem
─────────────────────────────────────
Bandit state in memory   → Lost on restart
Local A_inv matrices     → No shared learning
Per-process middleware   → Can't scale horizontally
Semaphore concurrency    → Local-only coordination

Target Architecture (Multi-Instance)

Multiple instances       Solution
─────────────────────────────────────
Redis bandit state       → Shared across instances
Distributed A_inv        → Consistent UCB calculations
Redis rate limiting      → Global coordination
Distributed locks        → Safe concurrent updates

Implementation Tasks

1. Redis State Serialization

# Store bandit state in Redis
class RedisBackedBandit:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.key_prefix = "conduit:bandit"
    
    def save_state(self, algorithm_name: str, state: dict):
        key = f"{self.key_prefix}:{algorithm_name}"
        self.redis.set(key, json.dumps(state))
    
    def load_state(self, algorithm_name: str) -> dict:
        key = f"{self.key_prefix}:{algorithm_name}"
        data = self.redis.get(key)
        return json.loads(data) if data else {}

2. Atomic Matrix Updates (LinUCB)

Challenge: Multiple instances updating A_inv matrices concurrently

Solution Options:

Option A: Distributed Locks (Strong Consistency)

from redis.lock import Lock

def update_linucb(self, arm, reward, context):
    lock = self.redis.lock(f"lock:linucb:{arm}", timeout=10)
    with lock:
        # Load current A_inv from Redis
        A_inv = self.load_matrix(arm)
        # Update with Woodbury identity
        A_inv_updated = woodbury_update(A_inv, context, reward)
        # Save back to Redis
        self.save_matrix(arm, A_inv_updated)

Option B: Eventual Consistency (Accept Divergence)

# Each instance maintains local A_inv
# Periodic sync with Redis (every N updates)
# Accept that instances may have slightly different UCB scores
# Document tradeoffs in ARCHITECTURE.md

Recommendation: Start with Option B (eventual consistency) for better performance, upgrade to Option A if consistency issues observed.

3. Distributed Rate Limiting

from redis import Redis

class RedisRateLimiter:
    def __init__(self, redis: Redis, max_qps: int = 100):
        self.redis = redis
        self.max_qps = max_qps
    
    def acquire(self, provider: str) -> bool:
        key = f"rate_limit:{provider}"
        current = self.redis.incr(key)
        if current == 1:
            self.redis.expire(key, 1)  # 1-second window
        return current <= self.max_qps

4. State Compaction & Cleanup

# Leader election for cleanup tasks
def elect_leader():
    lock = redis.lock("conduit:leader", timeout=30)
    if lock.acquire(blocking=False):
        # This instance is the leader
        compact_old_state()
        cleanup_expired_keys()
        lock.release()

Success Criteria

  • Redis state backend implemented in conduit_bench/backends/redis.py
  • All bandit algorithms support Redis persistence
  • Atomic matrix updates for LinUCB (choose consistency model)
  • Distributed rate limiting working
  • Leader election for cleanup tasks
  • Multi-instance deployment example (Docker Compose or Kubernetes)
  • Load test validates performance with 3+ instances
  • Documentation in docs/SCALING.md

Testing Strategy

# Start 3 instances with shared Redis
docker-compose up --scale conduit=3

# Run load test targeting all 3 instances
k6 run tests/load/multi_instance.js

# Verify state consistency across instances
pytest tests/integration/test_redis_sync.py

Consistency Model Decision

Document tradeoffs in docs/SCALING.md:

Model Latency Consistency Complexity
Strong (locks) +50ms Perfect High
Eventual +5ms ~98% Medium
Optimistic +2ms ~95% Low

Priority

MEDIUM - Important for production scale, not needed for research

Difficulty

Advanced - Requires distributed systems design expertise

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    difficulty:advancedAdvanced difficulty - complex system designenhancementNew feature or requestpriority:mediumMedium priority - important but not blocking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions