-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
difficulty:advancedAdvanced difficulty - complex system designAdvanced difficulty - complex system designenhancementNew feature or requestNew feature or requestpriority:mediumMedium priority - important but not blockingMedium priority - important but not blocking
Description
Overview
Implement Redis-backed bandit state synchronization to enable multi-instance horizontal scaling.
Background
From STAFF_REVIEW.md: "Single-node, in-memory state breaks at scale"
Current Architecture (Single Instance)
Single instance Problem
─────────────────────────────────────
Bandit state in memory → Lost on restart
Local A_inv matrices → No shared learning
Per-process middleware → Can't scale horizontally
Semaphore concurrency → Local-only coordination
Target Architecture (Multi-Instance)
Multiple instances Solution
─────────────────────────────────────
Redis bandit state → Shared across instances
Distributed A_inv → Consistent UCB calculations
Redis rate limiting → Global coordination
Distributed locks → Safe concurrent updates
Implementation Tasks
1. Redis State Serialization
# Store bandit state in Redis
class RedisBackedBandit:
def __init__(self, redis_client):
self.redis = redis_client
self.key_prefix = "conduit:bandit"
def save_state(self, algorithm_name: str, state: dict):
key = f"{self.key_prefix}:{algorithm_name}"
self.redis.set(key, json.dumps(state))
def load_state(self, algorithm_name: str) -> dict:
key = f"{self.key_prefix}:{algorithm_name}"
data = self.redis.get(key)
return json.loads(data) if data else {}2. Atomic Matrix Updates (LinUCB)
Challenge: Multiple instances updating A_inv matrices concurrently
Solution Options:
Option A: Distributed Locks (Strong Consistency)
from redis.lock import Lock
def update_linucb(self, arm, reward, context):
lock = self.redis.lock(f"lock:linucb:{arm}", timeout=10)
with lock:
# Load current A_inv from Redis
A_inv = self.load_matrix(arm)
# Update with Woodbury identity
A_inv_updated = woodbury_update(A_inv, context, reward)
# Save back to Redis
self.save_matrix(arm, A_inv_updated)Option B: Eventual Consistency (Accept Divergence)
# Each instance maintains local A_inv
# Periodic sync with Redis (every N updates)
# Accept that instances may have slightly different UCB scores
# Document tradeoffs in ARCHITECTURE.mdRecommendation: Start with Option B (eventual consistency) for better performance, upgrade to Option A if consistency issues observed.
3. Distributed Rate Limiting
from redis import Redis
class RedisRateLimiter:
def __init__(self, redis: Redis, max_qps: int = 100):
self.redis = redis
self.max_qps = max_qps
def acquire(self, provider: str) -> bool:
key = f"rate_limit:{provider}"
current = self.redis.incr(key)
if current == 1:
self.redis.expire(key, 1) # 1-second window
return current <= self.max_qps4. State Compaction & Cleanup
# Leader election for cleanup tasks
def elect_leader():
lock = redis.lock("conduit:leader", timeout=30)
if lock.acquire(blocking=False):
# This instance is the leader
compact_old_state()
cleanup_expired_keys()
lock.release()Success Criteria
- Redis state backend implemented in
conduit_bench/backends/redis.py - All bandit algorithms support Redis persistence
- Atomic matrix updates for LinUCB (choose consistency model)
- Distributed rate limiting working
- Leader election for cleanup tasks
- Multi-instance deployment example (Docker Compose or Kubernetes)
- Load test validates performance with 3+ instances
- Documentation in
docs/SCALING.md
Testing Strategy
# Start 3 instances with shared Redis
docker-compose up --scale conduit=3
# Run load test targeting all 3 instances
k6 run tests/load/multi_instance.js
# Verify state consistency across instances
pytest tests/integration/test_redis_sync.pyConsistency Model Decision
Document tradeoffs in docs/SCALING.md:
| Model | Latency | Consistency | Complexity |
|---|---|---|---|
| Strong (locks) | +50ms | Perfect | High |
| Eventual | +5ms | ~98% | Medium |
| Optimistic | +2ms | ~95% | Low |
Priority
MEDIUM - Important for production scale, not needed for research
Difficulty
Advanced - Requires distributed systems design expertise
Dependencies
- Redis 6.0+ (for distributed locks)
- Load testing suite (Issue Load testing suite with locust or k6 #24) for validation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
difficulty:advancedAdvanced difficulty - complex system designAdvanced difficulty - complex system designenhancementNew feature or requestNew feature or requestpriority:mediumMedium priority - important but not blockingMedium priority - important but not blocking