Skip to content

Explainer: Memory Cache Metrics API with Eviction Tracking #32

@rjmurillo

Description

@rjmurillo

Introduction/Overview

The Memory Cache Metrics API with Eviction Tracking addresses critical performance issues in containerized environments where memory pressure can cause cache thrashing, leading to degraded application performance. This feature modernizes cache usage patterns by providing visibility into cache behavior and enabling proactive memory pressure handling.

The primary goal is to prevent performance degradation by tracking cache evictions, providing comprehensive metrics, and supporting both global default caches and component-specific caches in a backward-compatible manner.

Goals

  1. Provide eviction visibility: Track and report cache eviction counts with ~100ns acceptable overhead per operation
  2. Enable proactive monitoring: Support comprehensive cache metrics (hits, misses, eviction reasons, memory usage)
  3. Support modern deployment patterns: Design for container-friendly memory management with dynamic sizing capabilities
  4. Maintain backward compatibility: Ensure existing applications using MemoryCacheStatistics continue to work without modification
  5. Enable external integration: Support export to monitoring systems (Prometheus, Application Insights, OpenTelemetry)
  6. Provide flexible registration: Support both automatic DI-based registration and explicit component-specific cache registration

Non-Goals (Out of Scope)

  • Automatic cache size adjustment based on memory pressure (future enhancement)
  • Custom eviction policy implementation
  • Cache data persistence or recovery mechanisms
  • Real-time alerting or notification systems
  • Performance optimization beyond the ~100ns overhead target
  • Migration tools for existing custom monitoring solutions
  • Distributed, nested, or hierarchical cache; use HybridCache
  • Recommendations for cache size limits based on container memory constraints

User Stories

  1. As a service developer, I want to track eviction counts in my application's default memory cache so that I can identify when memory pressure is causing performance issues.
  2. As a library author, I want to register my component's cache with a metrics system so that service owners can monitor my library's cache behavior alongside their application caches.
  3. As a DevOps engineer, I want to export cache metrics to Prometheus so that I can create dashboards and alerts for cache performance in containerized environments.
  4. As a performance engineer, I want to distinguish between different eviction reasons (memory pressure vs expiration) so that I can identify true performance problems versus normal cache operation.
  5. As an application architect, I want to configure cache metrics collection with different sampling rates so that I can balance monitoring granularity with performance overhead.

Functional Requirements

  1. The system must extend MemoryCacheStatistics to include TotalEvictedEntries property without breaking existing applications.
  2. The system must provide a MemoryCacheMetrics service that can register and track multiple named caches through dependency injection.
  3. The system must support eviction tracking with configurable overhead allowing sampling rates from real-time to periodic (5-30 seconds).
  4. The system must distinguish between eviction reasons including memory pressure, expiration, and manual removal.
  5. The system must prevent duplicate cache registration by maintaining weak references to registered cache instances.
  6. The system must handle naming conflicts by either throwing exceptions for duplicates or using automatic resolution strategies.
  7. The system must integrate with OpenTelemetry/IMeterFactory to enable export to external monitoring systems.
  8. The system must provide extension methods for easy service registration in ASP.NET Core applications.
  9. The system must implement circuit breaker functionality to reduce metrics collection if overhead exceeds configurable thresholds.
  10. The system must support both opt-in statistics tracking and automatic discovery of DI-registered caches.
  11. The system must provide comprehensive metrics including hit/miss ratios, cache size, item lifecycle data, and operation latency.
  12. The system must use weak references to prevent memory leaks when clients forget to unregister caches.

Design Considerations

  • API Surface: Extend existing MemoryCacheStatistics with nullable TotalEvictedEntries to maintain backward compatibility
  • Registration Pattern: Use explicit registration for component-specific caches with potential future automatic discovery
  • Configuration Tiers: Provide no-config defaults, simple predefined profiles, and advanced fine-grained control
  • Memory Safety: Implement weak reference patterns to prevent memory leaks from unregistered caches
  • Performance: Design for minimal overhead with configurable sampling and circuit breaker patterns

Technical Considerations

  • Integration with existing DI container: Leverage IMeterFactory for OpenTelemetry compatibility
  • Thread safety: Ensure metrics collection is thread-safe for high-concurrency scenarios
  • Weak reference management: Implement proper cleanup of disposed cache references
  • Sampling strategies: Support configurable sampling rates to balance accuracy with performance
  • Export mechanisms: Design pluggable exporters for different monitoring backends

Success Metrics

  1. Performance overhead: Maintain <100ns per cache operation when metrics are enabled
  2. Adoption rate: Achieve integration in existing applications without requiring code changes (for basic scenarios)
  3. Diagnostic value: Enable identification of cache thrashing patterns that were previously invisible
  4. Container efficiency: Reduce memory-related performance issues in containerized deployments by 20%
  5. Monitoring integration: Support export to at least 3 major monitoring platforms (Prometheus, Application Insights, Data Dog)

Open Questions

  1. Automatic cache sizing documentation: (future enhancement) Provide recommendations for cache configuration based on container memory constraints
  2. Metric retention: How long should in-memory metrics be retained before aggregation/export, and should this be configurable?
  3. Performance testing scope: What specific performance benchmarks should be established to validate the <100ns overhead target across different cache usage patterns?
  4. Migration documentation: What level of detail is needed in migration guides for applications currently using custom cache monitoring solutions?

Parent Tasks for Memory Cache Metrics API with Eviction Tracking

1. Extend MemoryCacheStatistics with Eviction Tracking

  • Enhance Microsoft's MemoryCacheStatistics to include TotalEvictedEntries property while maintaining backward compatibility

Sub-tasks:

  • Create MemoryCacheStatisticsExtensions class with nullable TotalEvictedEntries property
  • Implement backward-compatible extension methods for existing MemoryCacheStatistics
  • Add thread-safe eviction counting mechanism
  • Create mapping between PostEvictionReason and eviction categories
  • Implement statistics aggregation for multiple cache instances
  • Add validation and error handling for statistics collection
  • Write comprehensive unit tests for statistics extensions
  • Add integration tests with existing cache implementations
  • Create performance benchmarks to validate <100ns overhead target
  • Update API documentation and usage examples

2. Create MemoryCacheMetrics Service Infrastructure

  • Develop a centralized service for registering and tracking multiple named caches with weak reference management and conflict resolution

Sub-tasks:

  • Design IMemoryCacheMetrics interface with registration and tracking methods
  • Implement MemoryCacheMetrics service with weak reference management
  • Create cache registration system with naming conflict resolution
  • Implement automatic cleanup of disposed cache references
  • Add thread-safe concurrent access patterns for multi-cache scenarios
  • Create cache discovery mechanism for DI-registered caches
  • Implement metrics aggregation across multiple named caches
  • Add cache lifecycle management (registration, tracking, cleanup)
  • Write comprehensive unit tests for service functionality
  • Create integration tests for multi-cache scenarios
  • Add service registration extensions for dependency injection
  • Document service usage patterns and best practices

3. Implement Circuit Breaker and Sampling Mechanisms

  • Add configurable overhead protection with circuit breaker functionality and sampling rates to maintain the ~100ns performance target

Sub-tasks:

  • Design IMemoryCacheMetrics interface with registration and tracking methods
  • Implement MemoryCacheMetrics service with weak reference management
  • Create cache registration system with naming conflict resolution
  • Implement automatic cleanup of disposed cache references
  • Add thread-safe concurrent access patterns for multi-cache scenarios
  • Create cache discovery mechanism for DI-registered caches
  • Implement metrics aggregation across multiple named caches
  • Add cache lifecycle management (registration, tracking, cleanup)
  • Write comprehensive unit tests for service functionality
  • Create integration tests for multi-cache scenarios
  • Add service registration extensions for dependency injection
  • Document service usage patterns and best practices

4. Add Advanced Cache Management Features

  • Implement component-specific cache registration, automatic discovery of DI-registered caches, and enhanced export mechanisms

Sub-tasks:

  • Design IMemoryCacheMetrics interface with registration and tracking methods
  • Implement MemoryCacheMetrics service with weak reference management
  • Create cache registration system with naming conflict resolution
  • Implement automatic cleanup of disposed cache references
  • Add thread-safe concurrent access patterns for multi-cache scenarios
  • Create cache discovery mechanism for DI-registered caches
  • Implement metrics aggregation across multiple named caches
  • Add cache lifecycle management (registration, tracking, cleanup)
  • Write comprehensive unit tests for service functionality
  • Create integration tests for multi-cache scenarios
  • Add service registration extensions for dependency injection
  • Document service usage patterns and best practices

5. Integrate OpenTelemetry and External Monitoring

  • Enhance existing OpenTelemetry integration with support for multiple monitoring backends (Prometheus, Application Insights) and comprehensive metrics export

Sub-tasks:

  • Enhance existing OpenTelemetry integration with new metrics
  • Create Prometheus metrics exporter with proper label handling
  • Implement Application Insights integration with custom metrics
  • Add support for Grafana dashboard configuration
  • Create pluggable exporter architecture for extensibility
  • Implement metric transformation and aggregation for different backends
  • Add health checks and monitoring for export pipeline
  • Create configuration system for multiple export destinations
  • Implement retry logic and error handling for export failures
  • Write integration tests for all supported monitoring backends
  • Create example configurations for popular monitoring setups
  • Document monitoring setup and troubleshooting guides

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is neededquestionFurther information is requested

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions