Skip to content

2643 - add Redis-backed metrics cache for multi-instance deployments#2857

Draft
gcgoncalves wants to merge 5 commits intomainfrom
2643-metrics-aggregation
Draft

2643 - add Redis-backed metrics cache for multi-instance deployments#2857
gcgoncalves wants to merge 5 commits intomainfrom
2643-metrics-aggregation

Conversation

@gcgoncalves
Copy link
Collaborator

🐛 Bug-fix PR

📌 Summary

Implements centralized metrics aggregation cache using Redis to solve metric fluctuation issue in multi-instance deployments behind load balancers. This change is behind the METRICS_CACHE_USE_REDIS feature flag.

The downside of this approach is that the metrics are updated on the UI only after the cache TTL is expired. The cache TTL is defined using the METRICS_CACHE_TTL_SECONDS environment setting.

Closes #2643

🔁 Reproduction Steps

  1. Navigate to http://localhost:8080/admin/#metrics
  2. Note the "Total Executions" value
  3. Refresh the page (F5 or Ctrl+R)
  4. Observe the value changes significantly (up or down)
  5. Repeat several times - values appear random

#2643

🐞 Root Cause

  • Each gateway instance had its own in-memory cache of PostgreSQL aggregates
  • Load balancer routing to different instances showed different cached values
  • Metrics fluctuated non-monotonically on page refresh (e.g., 32075 → 21858)

💡 Fix Description

  • Shared Redis cache for aggregated query results across all instances
  • Cache invalidation after buffer flush ensures consistency
  • Automatic fallback to local cache if Redis unavailable
  • Default enabled (METRICS_CACHE_USE_REDIS=true)

🧪 Verification

Check Command Status
Lint suite make lint
Unit tests make test
Coverage ≥ 80 % make coverage
Manual regression no longer fails steps / screenshots

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • No secrets/credentials committed

@crivetimihai crivetimihai added this to the Release 1.0.0-GA milestone Feb 12, 2026
@gcgoncalves gcgoncalves force-pushed the 2643-metrics-aggregation branch 2 times, most recently from 3488f7f to 7f4f2ac Compare February 12, 2026 11:11
@crivetimihai
Copy link
Member

Thanks @gcgoncalves. Solid implementation — good use of dual-write (Redis + local) with automatic fallback, Prometheus hit/miss counters, and one-time warning for sync callers when Redis is active.

A few observations from the diff:

  1. Thread safety: _total_hit_count, _redis_hit_count, etc. are incremented outside the lock in get_async. These are stats counters so races aren't critical, but consider threading.Lock or atomic counters if precision matters.
  2. JSON serialization: Redis values are json.loads/json.dumps. The local cache stores Python dicts directly. Ensure the serialized forms are consistent (e.g., datetime values won't roundtrip through JSON).
  3. Docs quality: The Compose and Kubernetes deployment docs are thorough, though the Compose example includes PLATFORM_ADMIN_PASSWORD: "changeme" and JWT_SECRET_KEY: "your-secret-key" — consider adding a comment warning not to use these in production.
  4. Feature flag fallback: When METRICS_CACHE_USE_REDIS=true but Redis is unreachable at startup, does it fall back gracefully to local, or fail? The code shows redis_client is not None and REDIS_AVAILABLE — looks like it handles the import-missing case but not connection-refused at init.

Implements centralized metrics aggregation cache using Redis to solve
metric fluctuation issue in multi-instance deployments behind load balancers.

Problem:
- Each gateway instance had its own in-memory cache of PostgreSQL aggregates
- Load balancer routing to different instances showed different cached values
- Metrics fluctuated non-monotonically on page refresh (e.g., 32075 → 21858)

Solution:
- Shared Redis cache for aggregated query results across all instances
- Cache invalidation after buffer flush ensures consistency
- Automatic fallback to local cache if Redis unavailable
- Default enabled (METRICS_CACHE_USE_REDIS=true)

Changes:
- mcpgateway/cache/metrics_cache.py: Add Redis backend with async methods,
  Prometheus metrics, automatic fallback to local cache
- mcpgateway/config.py: Add metrics_cache_use_redis setting (default: true)
- mcpgateway/services/metrics_buffer_service.py: Invalidate cache after flush
- docs/docs/manage/configuration.md: Document METRICS_CACHE_USE_REDIS
- docs/docs/deployment/container.md: Add Docker Compose Redis example
- docs/docs/deployment/kubernetes.md: Add multi-instance deployment guide
- docs/docs/architecture/performance-architecture.md: Add to caching architecture

Technical Details:
- PostgreSQL aggregation (func.sum/count) was already correct
- Issue was per-instance caching, not aggregation
- Redis provides shared invalidation signal for all instances
- Cache TTL: 60 seconds (configurable via METRICS_CACHE_TTL_SECONDS)

Closes #2643

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
@gcgoncalves gcgoncalves force-pushed the 2643-metrics-aggregation branch 2 times, most recently from 32b0078 to 3488f7f Compare February 13, 2026 16:10
Signed-off-by: Gabriel Costa <gabrielcg@proton.me>
@gcgoncalves gcgoncalves force-pushed the 2643-metrics-aggregation branch from 3488f7f to 31d098d Compare February 13, 2026 17:18
@crivetimihai crivetimihai self-assigned this Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][UI]: Total Executions metric fluctuates randomly on page refresh

2 participants