2643 - add Redis-backed metrics cache for multi-instance deployments by gcgoncalves · Pull Request #2857 · IBM/mcp-context-forge

gcgoncalves · 2026-02-11T21:23:03Z

🐛 Bug-fix PR

📌 Summary

Implements centralized metrics aggregation cache using Redis to solve metric fluctuation issue in multi-instance deployments behind load balancers. This change is behind the METRICS_CACHE_USE_REDIS feature flag.

The downside of this approach is that the metrics are updated on the UI only after the cache TTL is expired. The cache TTL is defined using the METRICS_CACHE_TTL_SECONDS environment setting.

Closes #2643

🔁 Reproduction Steps

Navigate to http://localhost:8080/admin/#metrics
Note the "Total Executions" value
Refresh the page (F5 or Ctrl+R)
Observe the value changes significantly (up or down)
Repeat several times - values appear random

#2643

🐞 Root Cause

Each gateway instance had its own in-memory cache of PostgreSQL aggregates
Load balancer routing to different instances showed different cached values
Metrics fluctuated non-monotonically on page refresh (e.g., 32075 → 21858)

💡 Fix Description

Shared Redis cache for aggregated query results across all instances
Cache invalidation after buffer flush ensures consistency
Automatic fallback to local cache if Redis unavailable
Default enabled (METRICS_CACHE_USE_REDIS=true)

🧪 Verification

Check	Command	Status
Lint suite	`make lint`	✅
Unit tests	`make test`	✅
Coverage ≥ 80 %	`make coverage`	✅
Manual regression no longer fails	steps / screenshots	✅

✅ Checklist

Code formatted (make black isort pre-commit)
No secrets/credentials committed

crivetimihai · 2026-02-13T08:58:44Z

Thanks @gcgoncalves. Solid implementation — good use of dual-write (Redis + local) with automatic fallback, Prometheus hit/miss counters, and one-time warning for sync callers when Redis is active.

A few observations from the diff:

Thread safety: _total_hit_count, _redis_hit_count, etc. are incremented outside the lock in get_async. These are stats counters so races aren't critical, but consider threading.Lock or atomic counters if precision matters.
JSON serialization: Redis values are json.loads/json.dumps. The local cache stores Python dicts directly. Ensure the serialized forms are consistent (e.g., datetime values won't roundtrip through JSON).
Docs quality: The Compose and Kubernetes deployment docs are thorough, though the Compose example includes PLATFORM_ADMIN_PASSWORD: "changeme" and JWT_SECRET_KEY: "your-secret-key" — consider adding a comment warning not to use these in production.
Feature flag fallback: When METRICS_CACHE_USE_REDIS=true but Redis is unreachable at startup, does it fall back gracefully to local, or fail? The code shows redis_client is not None and REDIS_AVAILABLE — looks like it handles the import-missing case but not connection-refused at init.

Implements centralized metrics aggregation cache using Redis to solve metric fluctuation issue in multi-instance deployments behind load balancers. Problem: - Each gateway instance had its own in-memory cache of PostgreSQL aggregates - Load balancer routing to different instances showed different cached values - Metrics fluctuated non-monotonically on page refresh (e.g., 32075 → 21858) Solution: - Shared Redis cache for aggregated query results across all instances - Cache invalidation after buffer flush ensures consistency - Automatic fallback to local cache if Redis unavailable - Default enabled (METRICS_CACHE_USE_REDIS=true) Changes: - mcpgateway/cache/metrics_cache.py: Add Redis backend with async methods, Prometheus metrics, automatic fallback to local cache - mcpgateway/config.py: Add metrics_cache_use_redis setting (default: true) - mcpgateway/services/metrics_buffer_service.py: Invalidate cache after flush - docs/docs/manage/configuration.md: Document METRICS_CACHE_USE_REDIS - docs/docs/deployment/container.md: Add Docker Compose Redis example - docs/docs/deployment/kubernetes.md: Add multi-instance deployment guide - docs/docs/architecture/performance-architecture.md: Add to caching architecture Technical Details: - PostgreSQL aggregation (func.sum/count) was already correct - Issue was per-instance caching, not aggregation - Redis provides shared invalidation signal for all instances - Cache TTL: 60 seconds (configurable via METRICS_CACHE_TTL_SECONDS) Closes #2643 Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

crivetimihai added this to the Release 1.0.0-GA milestone Feb 12, 2026

gcgoncalves force-pushed the 2643-metrics-aggregation branch 2 times, most recently from 3488f7f to 7f4f2ac Compare February 12, 2026 11:11

gcgoncalves added 3 commits February 13, 2026 14:43

Use async methods

977a4f8

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

Use async mock methods

757deec

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

gcgoncalves force-pushed the 2643-metrics-aggregation branch 2 times, most recently from 32b0078 to 3488f7f Compare February 13, 2026 16:10

Increased coverage

31d098d

Signed-off-by: Gabriel Costa <gabrielcg@proton.me>

gcgoncalves force-pushed the 2643-metrics-aggregation branch from 3488f7f to 31d098d Compare February 13, 2026 17:18

Merge branch 'main' into 2643-metrics-aggregation

fcfe9a8

crivetimihai self-assigned this Feb 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2643 - add Redis-backed metrics cache for multi-instance deployments#2857

2643 - add Redis-backed metrics cache for multi-instance deployments#2857
gcgoncalves wants to merge 5 commits intomainfrom
2643-metrics-aggregation

gcgoncalves commented Feb 11, 2026

Uh oh!

crivetimihai commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gcgoncalves commented Feb 11, 2026

🐛 Bug-fix PR

📌 Summary

🔁 Reproduction Steps

🐞 Root Cause

💡 Fix Description

🧪 Verification

✅ Checklist

Uh oh!

crivetimihai commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants