Skip to content

[Feature Request]: Multi-Layer Caching System (Memory + Redis)Β #289

@crivetimihai

Description

@crivetimihai

🧭 Epic - Multi-Layer Caching System (Memory + Redis)

Title: Multi-Layer Caching System (Memory + Redis)
Goal: Implement dual-layer caching with fast local memory (L1) and distributed Redis (L2) to eliminate both database AND Redis network latency
Why now: Even Redis calls add 1-5ms network latency. Local memory cache provides sub-millisecond access while Redis enables sharing across instances.


πŸ™‹β€β™‚οΈ User Stories & Acceptance Criteria

Story 1 β€” Fast L1 Hit

As a service handler
I want frequently requested items to come from the local memory cache
So that response latency stays well below 1 ms.

βœ… Acceptance Criteria
Scenario: L1 cache hit after warm-up
  Given "tools:list:false" is already in the L1 cache
  When the service calls get_or_set("tools:list:false")
  Then the cache manager returns the value in < 1 ms
  And the metrics counter l1_hits increments by 1
  And no Redis GET operation occurs

Story 2 β€” L2 Redis Fallback & L1 Population

As a platform engineer
I want a miss in L1 to fall back to Redis and then fill L1
So that the next request is even faster.

βœ… Acceptance Criteria
Scenario: L1 miss, L2 hit, L1 warmed
  Given "servers:list:false" exists in Redis but not in L1
  When the service calls get_or_set("servers:list:false")
  Then the cache manager reads the value from Redis in < 5 ms
  And writes the same key to L1 with a 45-second TTL
  And returns the value to the caller
  And the metrics counters l1_misses and l2_hits increment by 1

Story 3 β€” Cross-Instance Invalidation

As a cluster operator
I want cache invalidations in one instance to evict the same keys from every other instance’s L1
So that no stale data is served after an update.

βœ… Acceptance Criteria
Scenario: Broadcast invalidation pattern
  Given two gateway instances A and B are running
  And both have "mcp:tools:list:false" in their L1 caches
  When instance A calls invalidate("mcp:tools:*")
  Then instance A deletes the matching keys from its own L1
  And instance A publishes "cache:invalidate" with the pattern
  And instance B receives the Pub/Sub message
  And instance B deletes the same keys from its L1 within 100 ms

Story 4 β€” Memory-Pressure Eviction

As a reliability engineer
I want the L1 cache to evict items when memory exceeds 80 % of the limit
So that the container never OOM-kills.

βœ… Acceptance Criteria
Scenario: L1 exceeds memory threshold
  Given the configured L1 memory limit is 100 MB
  And current L1 usage reaches 82 MB
  When the memory-pressure monitor runs
  Then at least 1 000 low-priority keys are evicted
  And L1 usage drops below 80 MB
  And the metrics counter l1_evictions increments accordingly

Story 5 β€” Redis Outage Resilience

As a SRE on-call
I want the gateway to continue serving requests from L1 and fall back to the DB if Redis is unavailable
So that users experience degraded but functional service.

βœ… Acceptance Criteria
Scenario: Redis GET timeout
  Given Redis is unreachable or times out
  And the requested key is not in L1
  When get_or_set("federation:roots") is called
  Then the cache manager logs "Redis cache error"
  And fetches the data from the database
  And returns the data to the caller
  And populates only L1 (skips L2)

Story 6 β€” Metrics & Observability

As a dashboard user
I want detailed cache metrics exposed to Prometheus
So that I can monitor hit ratios and latency per layer.

βœ… Acceptance Criteria
Scenario: Metrics endpoint exposes cache stats
  When the Prometheus scraper calls /metrics
  Then the response contains:
    - cache_l1_hits_total
    - cache_l2_hits_total
    - cache_misses_total
    - cache_invalidation_total
    - cache_l1_memory_bytes
  And each metric has labels instance=<pod_name> and version=<git_sha>

πŸ–ΌοΈ Architecture

1 β€” Component & Data-flow

flowchart TD
    subgraph "Gateway Instance"
        API[Service / Handler] --> CM[MultiLayerCacheManager]
        CM --> L1[(L1 Memory Cache)]
        CM -- miss --> L2[(L2 Redis Cache)]
        L1 -. Pub/Sub invalidations .-> CM
    end

    L2 -->|miss| DB[(Primary Database)]
    CM -. "metrics ➜ Prometheus" .-> MON[Monitoring]

    classDef store fill:#ffd1dc,stroke:#333;
    class DB,L2,L1 store
Loading
  • A request hits the Service β†’ consults CacheManager.
  • L1 (per-container) is checked first; on miss the manager tries L2 Redis; final fallback is the DB.
  • Redis Pub/Sub broadcasts invalidation patterns so every instance clears its own L1.
  • Cache metrics flow to Prometheus/Grafana.

2 β€” get_or_set() Sequence

sequenceDiagram
    participant SVC  as Service Code
    participant CM   as CacheManager
    participant L1   as Memory Cache
    participant L2   as Redis Cache
    participant DB   as Database

    SVC ->> CM: get_or_set(key)
    CM  ->> L1: lookup(key)
    alt L1 HIT
        L1 -->> CM: value
        CM -->> SVC: return value
    else L1 MISS
        CM ->> L2: GET key
        alt L2 HIT
            L2 -->> CM: value
            CM ->> L1: put key,value
            CM -->> SVC: return value
        else L2 MISS
            CM ->> DB: SELECT …
            DB -->> CM: value
            CM ->> L2: SETEX key,value,ttl
            CM ->> L1: put key,value
            CM -->> SVC: return value
        end
    end
Loading
  • Fast path β€” L1 hit: ~0.1 ms.
  • Typical path β€” L2 hit then populate L1: 2-5 ms.
  • Cold path β€” Miss in both layers, query DB, then fill caches.

πŸ“ Enhanced Design Sketch

Dual-Layer Cache Architecture:

class MultiLayerCacheManager:
    """L1 (Memory) + L2 (Redis) caching with intelligent invalidation"""
    
    def __init__(self):
        # L1: Fast local memory cache (per container)
        self.l1_cache = TTLCache(maxsize=10000, ttl=60)  # 10k items, 60s
        self.l1_lock = asyncio.Lock()
        
        # L2: Distributed Redis cache (shared across instances) 
        self.redis = Redis(url=settings.redis_url)
        
        # Metrics tracking
        self.metrics = CacheMetrics()
        
        # Background tasks
        self._start_background_tasks()
    
    async def get_or_set(self, key: str, factory: Callable, ttl: int = 300) -> Any:
        """Multi-layer cache lookup with fallback chain"""
        
        # πŸš€ L1 CHECK: Memory cache (fastest ~0.1ms)
        async with self.l1_lock:
            if key in self.l1_cache:
                self.metrics.l1_hits += 1
                logger.debug(f"L1 cache HIT: {key}")
                return self.l1_cache[key]
        
        # πŸ”„ L2 CHECK: Redis cache (fast ~1-5ms)
        try:
            cached = await self.redis.get(key)
            if cached:
                self.metrics.l2_hits += 1
                logger.debug(f"L2 cache HIT: {key}")
                data = json.loads(cached)
                
                # Populate L1 cache for next request
                async with self.l1_lock:
                    self.l1_cache[key] = data
                    
                return data
        except Exception as e:
            logger.warning(f"Redis cache error for {key}: {e}")
        
        # πŸ’Ύ CACHE MISS: Fetch from database (~50-200ms)
        self.metrics.misses += 1
        logger.debug(f"Cache MISS: {key} - fetching from source")
        
        data = await factory()
        
        # Populate both cache layers
        await self._populate_caches(key, data, ttl)
        
        return data
    
    async def _populate_caches(self, key: str, data: Any, ttl: int):
        """Populate both L1 and L2 caches"""
        # L1: Memory cache (shorter TTL for memory efficiency)
        l1_ttl = min(ttl, settings.cache_l1_max_ttl)
        async with self.l1_lock:
            self.l1_cache.setdefault(key, data, ttl=l1_ttl)
        
        # L2: Redis cache (full TTL for persistence)
        try:
            await self.redis.setex(key, ttl, json.dumps(data, default=str))
        except Exception as e:
            logger.warning(f"Failed to populate Redis cache for {key}: {e}")
    
    async def invalidate(self, pattern: str):
        """Invalidate across both cache layers"""
        # L1: Clear matching keys from memory
        async with self.l1_lock:
            keys_to_remove = [k for k in self.l1_cache.keys() if fnmatch(k, pattern)]
            for key in keys_to_remove:
                del self.l1_cache[key]
                logger.debug(f"L1 invalidated: {key}")
        
        # L2: Clear from Redis and notify other instances
        redis_keys = await self.redis.keys(pattern)
        if redis_keys:
            await self.redis.delete(*redis_keys)
            # Publish invalidation event to other instances
            await self.redis.publish("cache:invalidate", pattern)
            logger.debug(f"L2 invalidated: {pattern} ({len(redis_keys)} keys)")
    
    def _start_background_tasks(self):
        """Start background maintenance tasks"""
        asyncio.create_task(self._listen_for_invalidations())
        asyncio.create_task(self._memory_pressure_monitor())
        asyncio.create_task(self._cache_stats_reporter())

Performance Characteristics:

Request Path                     | Latency | Use Case
─────────────────────────────────┼─────────┼─────────────────────
L1 Memory Hit                    | ~0.1ms  | Frequently accessed data
L2 Redis Hit β†’ L1 Population     | ~2-5ms  | Recently accessed data  
Cache Miss β†’ DB β†’ L1+L2 Populate | ~50-200ms| First access or expired

πŸ”§ Enhanced Configuration

class CacheConfig:
    # === L1 (Memory) Cache Settings ===
    l1_cache_enabled: bool = True           # Enable memory caching
    l1_cache_size: int = 10000             # Max items in memory per instance
    l1_cache_ttl: int = 60                 # Default L1 TTL (seconds)
    l1_max_ttl: int = 300                  # Maximum L1 TTL (memory limit)
    l1_memory_limit_mb: int = 100          # Max memory usage for L1 cache
    
    # === L2 (Redis) Cache Settings ===
    l2_cache_enabled: bool = True           # Enable Redis caching
    l2_connection_pool_size: int = 20       # Redis connection pool
    l2_timeout_ms: int = 100               # Redis operation timeout
    
    # === Entity-specific TTLs ===
    tools_cache_ttl: int = 300             # 5 minutes (L2 TTL)
    tools_cache_l1_ttl: int = 60           # 1 minute (L1 TTL)
    
    servers_cache_ttl: int = 180           # 3 minutes (L2 TTL) 
    servers_cache_l1_ttl: int = 45         # 45 seconds (L1 TTL)
    
    federation_cache_ttl: int = 300        # 5 minutes (L2 TTL)
    federation_cache_l1_ttl: int = 60      # 1 minute (L1 TTL)
    
    # === Cache Behavior ===
    cache_on_write: bool = True            # Populate on creates/updates
    l1_l2_sync_enabled: bool = True        # Sync L1 across instances via Redis
    memory_pressure_threshold: float = 0.8 # Evict L1 at 80% memory limit

πŸš€ Updated Performance Expectations

Before Caching:

GET /v1/tools (database query):     150ms avg
GET /v1/servers (with joins):       250ms avg  
Federation call:                    400ms avg

After L2 (Redis) Only:

GET /v1/tools (Redis hit):          5ms avg
GET /v1/servers (Redis hit):        7ms avg
Federation call (cached):           3ms avg

After L1+L2 (Memory+Redis):

GET /v1/tools (L1 hit):             0.2ms avg  ⚑
GET /v1/servers (L1 hit):           0.3ms avg  ⚑
Federation call (L1 hit):           0.1ms avg  ⚑

Cache Hit Distribution (after warmup):

  • L1 Memory hits: ~85% (most frequent data)
  • L2 Redis hits: ~12% (less frequent but cached)
  • Cache misses: ~3% (new/expired data)

🧭 Updated Tasks

  • Phase 1: Enhanced Cache Infrastructure

    • Implement MultiLayerCacheManager with both L1 and L2

      # mcpgateway/cache/multilayer.py
      class MultiLayerCacheManager:
          def __init__(self, l1_config: L1Config, l2_config: L2Config)
          async def get_or_set(key, factory, l1_ttl, l2_ttl)
          async def invalidate_pattern(pattern)
          async def get_cache_stats() -> CacheStats
    • Add memory monitoring and pressure management

      async def _memory_pressure_monitor(self):
          """Monitor memory usage and evict L1 cache when needed"""
          while True:
              memory_usage = self._get_l1_memory_usage()
              if memory_usage > self.config.memory_pressure_threshold:
                  await self._evict_l1_lru_items(count=1000)
              await asyncio.sleep(30)
    • Implement cross-instance L1 invalidation via Redis pub/sub

      async def _listen_for_invalidations(self):
          """Listen for invalidation events from other instances"""
          pubsub = self.redis.pubsub()
          await pubsub.subscribe("cache:invalidate")
          
          async for message in pubsub.listen():
              if message["type"] == "message":
                  pattern = message["data"].decode()
                  await self._invalidate_l1_pattern(pattern)
  • Phase 2: Service Integration with Dual TTLs

    • Update all services with L1+L2 aware caching

      @cached(
          key="mcp:tools:list:{include_inactive}",
          l1_ttl=60,    # 1 minute in memory
          l2_ttl=300    # 5 minutes in Redis
      )
      async def list_tools(self, db: Session, include_inactive: bool = False):
          return await self._fetch_tools_from_db(db, include_inactive)
    • Implement smart cache warming for frequently accessed data

      async def warm_l1_cache(self):
          """Pre-populate L1 cache with hot data on startup"""
          hot_keys = [
              "mcp:tools:list:false",
              "mcp:servers:list:false", 
              "mcp:metrics:aggregated:1h"
          ]
          for key in hot_keys:
              await self.cache_manager.get_or_set(key, lambda: self._fetch_hot_data(key))
  • Phase 3: Memory Management & Optimization

    • Implement intelligent L1 eviction policies

      class L1EvictionManager:
          def __init__(self):
              self.access_tracker = defaultdict(int)
              self.size_tracker = defaultdict(int)
          
          async def evict_candidates(self, target_count: int) -> List[str]:
              """Return keys to evict based on LRU + size + access frequency"""
              # Prioritize large, infrequently accessed items
              candidates = sorted(
                  self.l1_cache.keys(),
                  key=lambda k: (
                      -self.access_tracker[k],  # Less frequently accessed first
                      -self.size_tracker[k]     # Larger items first
                  )
              )
              return candidates[:target_count]
    • Add L1 cache size monitoring and alerts

    • Implement configurable L1 cache partitioning by entity type

  • Phase 4: Enhanced Monitoring

    • Comprehensive cache metrics with layer breakdown

      class DetailedCacheMetrics:
          # L1 metrics
          l1_hits: int = 0
          l1_misses: int = 0
          l1_evictions: int = 0
          l1_memory_usage_mb: float = 0
          
          # L2 metrics  
          l2_hits: int = 0
          l2_misses: int = 0
          l2_network_errors: int = 0
          
          # Performance metrics
          avg_l1_response_time_ms: float = 0.2
          avg_l2_response_time_ms: float = 3.5
          avg_db_response_time_ms: float = 150.0
          
          def overall_hit_ratio(self) -> float:
              total = self.l1_hits + self.l2_hits + self.l2_misses
              return (self.l1_hits + self.l2_hits) / total if total > 0 else 0.0
    • Add cache layer visualization to Admin UI

    • Implement cache performance alerting


🎯 Updated Success Criteria

  • L1 cache hit ratio > 80% for frequently accessed endpoints after warmup
  • Combined L1+L2 hit ratio > 95% overall
  • Sub-millisecond response times for L1 cache hits
  • L1 memory usage < 100MB per container instance
  • Cross-instance cache invalidation < 100ms propagation time
  • Cache system survives Redis outages (L1 continues working)
  • Zero cache inconsistency during normal operations

🧩 Additional Notes

  • πŸš€ Performance: L1 cache provides 100-500x faster access than Redis for hot data
  • πŸ’Ύ Memory Efficiency: Smart eviction prevents memory bloat while maximizing hit rates
  • πŸ”„ High Availability: L1 cache continues working even if Redis is down
  • πŸ“Š Scalability: Each container instance has its own L1 cache + shared L2 Redis
  • πŸ”§ Operational: Can disable L1 or L2 independently for debugging
  • πŸ“ˆ Monitoring: Detailed metrics for both cache layers and performance optimization
  • ⚑ Zero Network: L1 hits require zero network calls (Redis or database)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance related itemspythonPython / backend development (FastAPI)triageIssues / Features awaiting triage

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions