Skip to content

Commit 8fdb4b0

Browse files
authored
RS: Metrics stream engine best practices for monitoring (#2256)
* DOC-5822 RS: Metrics stream engine best practices for monitoring * Relref fix
1 parent 3c3d3d6 commit 8fdb4b0

File tree

2 files changed

+58
-0
lines changed

2 files changed

+58
-0
lines changed

content/operate/rs/monitoring/metrics_stream_engine.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,3 +59,59 @@ If you are already using the existing scraping endpoint for integration, do the
5959
1. Use the metrics tables in [this guide]({{<relref "/operate/rs/references/metrics/prometheus-metrics-v1-to-v2">}}) to transition from v1 metrics to equivalent v2 PromQL.
6060
6161
It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition.
62+
63+
## Best practices for monitoring
64+
65+
Follow these best practices when monitoring your Redis Enterprise Software cluster using the metrics stream engine.
66+
67+
### Monitor host-level metrics
68+
69+
For cluster health, resources, and node stability, monitor these metrics:
70+
71+
| Group | Metric | Why monitor | Unit |
72+
|-------|--------|-------------|------|
73+
| CPU utilization | `node_cpu_user`,<br />`node_cpu_system` | Detect CPU saturation from Redis or the OS that results in higher latency and queueing. | Seconds (counter) |
74+
| Memory (freeable) | <span class="break-all">`node_memory_MemTotal_bytes`</span>,<br /><span class="break-all">`node_memory_MemFree_bytes`</span>,<br /><span class="break-all">`node_memory_Buffers_bytes`</span>,<br /><span class="break-all">`node_memory_Cached_bytes`</span> | Detect memory pressure early. Low free memory or cache can precede swapping or out-of-memory errors. | Bytes (gauge) |
75+
| Swap usage | <span class="break-all">`node_ephemeral_storage_free`</span> | Monitor memory and disk pressure in your setup. Sustained pressure leads to latency spikes. | Bytes (gauge) |
76+
| Network traffic | <span class="break-all">`node_ingress_bytes`</span>,<br /><span class="break-all">`node_egress_bytes`</span> | Ensure the network interface is not saturated. Protects replication and client responsiveness. | Bytes (counter) |
77+
| Disk space | <span class="break-all">`node_filesystem_avail_bytes`</span>,<br /><span class="break-all">`node_filesystem_size_bytes`</span> | Prevent persistence and logging outages from low disk space. | Bytes (gauge) |
78+
| Cluster state | `has_quorum{…}` | Monitor whether quorum is maintained (1) or lost (0). | Boolean |
79+
| | `node_metrics_up` | Monitor whether the node is connected and reporting to the cluster. | Gauge |
80+
| Licensing | `license_shards_limit` | Track shard capacity limits by type (RAM or flash). | Count |
81+
| Certificates | <span class="break-all">`node_cert_expires_in_seconds`</span> | Avoid downtime from expired node certificates. | Seconds (gauge) |
82+
| Services – CPU | <span class="break-all">`namedprocess_namegroup_cpu_seconds_total`</span> | Identify abnormal CPU usage by platform services that can starve Redis, such as `alert_mgr`, `redis_mgr`, `dmc_proxy`. | Seconds (counter) |
83+
| Services – memory | <span class="break-all">`namedprocess_namegroup_memory_bytes`</span> | Detect memory leaks or outliers in platform services, such as `alert_mgr`, `redis_mgr`, `dmc_proxy`. | Bytes (gauge) |
84+
85+
### Monitor database-level metrics
86+
87+
For database performance, availability, and efficiency, monitor the following metrics:
88+
89+
| Group | Metric | Why monitor | Unit |
90+
|-------|--------|-------------|------|
91+
| Memory | <span class="break-all">`redis_server_used_memory`</span> | Track actual data memory to prevent out-of-memory errors and evictions. | Bytes |
92+
| Memory | `allocator_allocate` | Monitor bytes allocated by allocator (includes internal fragmentation). | Bytes |
93+
| Memory | `allocator_active` | Monitor bytes in active pages (includes external fragmentation). Use delta/ratio versus allocated to infer defraggable memory. | Bytes |
94+
| Memory | <span class="break-all">`active_defrag_running`</span> | Monitor if defragmentation is active and the intended CPU %. High values can affect performance. | % (gauge) |
95+
| Latency | <span class="break-all">`endpoint_read_requests_latency_histogram`</span>,<br /><span class="break-all">`endpoint_write_requests_latency_histogram`</span>,<br /><span class="break-all">`endpoint_other_requests_latency_histogram`</span> | Monitor server-side command latency. | Microseconds |
96+
| High availability | <span class="break-all">`redis_server_master_repl_offset`</span> | Compute replica throughput and lag using deltas over time. | Bytes (counter) |
97+
| High availability | <span class="break-all">`redis_server_master_link_status`</span> | Monitor replica link status (up or down) for early warning of high availability risk. | Status |
98+
| Active-Active | <span class="break-all">`database_syncer_dst_lag`</span>,<br /><span class="break-all">`database_syncer_lag_ms`</span> | Detect cross-region synchronization delays that impact consistency and SLAs. | Milliseconds (gauge) |
99+
| Active-Active | <span class="break-all">`database_syncer_state`</span> | Monitor operational state for troubleshooting synchronization issues. | Gauge |
100+
| Traffic – requests | <span class="break-all">`endpoint_read_requests`</span>,<br /><span class="break-all">`endpoint_write_requests`</span>,<br /><span class="break-all">`endpoint_other_requests`</span> | Monitor workload mix and spikes that drive capacity and latency. Total equals the sum of all three. | Counter |
101+
| Traffic – responses | <span class="break-all">`endpoint_read_responses`</span>,<br /><span class="break-all">`endpoint_write_responses`</span>,<br /><span class="break-all">`endpoint_other_responses`</span> | Validate service responsiveness and symmetry with requests. | Counter |
102+
| Traffic – bytes | <span class="break-all">`endpoint_ingress`</span>,<br /><span class="break-all">`endpoint_egress`</span> | Monitor size trends and watch for sudden growth that impacts egress costs or bandwidth. | Bytes (counter) |
103+
| Egress queue | <span class="break-all">`endpoint_egress_pending`</span>,<br /><span class="break-all">`endpoint_egress_pending_discarded`</span> | Monitor back-pressure and drops that indicate network or client issues. | Bytes (counter) |
104+
| Connections | <span class="break-all">`endpoint_client_connection`</span> | Monitor accepted connections over time and match against client rollouts or spikes. | Counter |
105+
| Connections | <span class="break-all">`endpoint_client_connection_expired`</span> | Monitor connections closed due to TTL expiry, which can indicate idle policy or client issues. | Counter |
106+
| Connections | <span class="break-all">`endpoint_longest_pipeline_histogram`</span> | Monitor long pipelines that can amplify latency bursts and detect misbehaving clients. | Histogram (count) |
107+
| Connections | <span class="break-all">`endpoint_client_connections`</span>,<br /><span class="break-all">`endpoint_client_disconnections`</span>,<br /><span class="break-all">`endpoint_proxy_disconnections`</span> | Monitor connection churn and identify who closed the socket (client versus proxy). Current connections ≈ connections − disconnections. | Counter |
108+
| Cache efficiency | <span class="break-all">`total_keys`</span>,<br /><span class="break-all">`total_volatile_keys`</span> | Monitor key inventory and TTL coverage to inform eviction strategy. | Counter |
109+
| Cache efficiency | <span class="break-all">`total_evicted_keys`</span>,<br /><span class="break-all">`total_expired_keys`</span> | Monitor eviction and expiry rates. Frequent evictions indicate memory pressure or poor sizing. | Counter |
110+
| Cache efficiency | `cache_hits`,<br /><span class="break-all">`cache_hit_rate`</span> | Monitor hit rate, which drives read latency and cost. Cache hit rate equals <span class="break-all">cache_hits/(cache_hits+cache_misses)</span>. | Count / Ratio (%) |
111+
| Cache efficiency | <span class="break-all">`endpoint_client_tracking_on_requests`</span>,<br /><span class="break-all">`endpoint_client_tracking_off_requests`</span>,<br /><span class="break-all">`endpoint_disposed_commands_after_client_caching`</span> | Track client-side caching usage and misuse. | Counter |
112+
| Big / complex keys | <span class="break-all">`redis_server_<data_type>_<size_or_items>_<bucket>`</span> | Monitor oversized keys and cardinality that cause fragmentation, slow replication, and CPU spikes. Track to prevent incidents. Examples:<br /><span class="break-all">`strings_sizes_over_512M`</span>,<br /><span class="break-all">`zsets_items_over_8M`</span> | Gauge |
113+
| Security – clients | <span class="break-all">`endpoint_client_expiration_refresh`</span>,<br /><span class="break-all">`endpoint_client_establishment_failures`</span> | Monitor unstable clients or problems with authentication or setup. | Counter |
114+
| Security – LDAP | <span class="break-all">`endpoint_successful_ldap_authentication`</span>,<br /><span class="break-all">`endpoint_failed_ldap_authentication`</span>,<br /><span class="break-all">`endpoint_disconnected_ldap_client`</span> | Monitor authentication health and detect brute-force attacks or misconfigurations. | Counter |
115+
| Security – cert-based | <span class="break-all">`endpoint_successful_cba_authentication`</span>,<br /><span class="break-all">`endpoint_failed_cba_authentication`</span>,<br /><span class="break-all">`endpoint_disconnected_cba_client`</span> | Monitor certificate authentication status and failures. | Counter |
116+
| Security – password | <span class="break-all">`endpoint_disconnected_user_password_client`</span> | Monitor password-authentication client disconnects and correlate with policy changes. | Counter |
117+
| Security – ACL | <span class="break-all">`acl_access_denied_auth`</span>,<br /><span class="break-all">`acl_access_denied_cmd`</span>,<br /><span class="break-all">`acl_access_denied_key`</span>,<br /><span class="break-all">`acl_access_denied_channel`</span> | Monitor unauthorized access attempts and incorrectly scoped ACLs. | Counter |

content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,8 @@ The [metrics stream engine]({{<relref "/operate/rs/monitoring/metrics_stream_eng
138138
139139
- As part of the transition to the metrics stream engine, some internal cluster manager alerts were deprecated in favor of external monitoring solutions. See the [alerts transition plan]({{<relref "/operate/rs/references/alerts/alerts-v1-to-v2">}}) for guidance.
140140
141+
- See [Best practices for monitoring]({{<relref "/operate/rs/monitoring/metrics_stream_engine#best-practices-for-monitoring">}}) for a list of recommended metrics to monitor.
142+
141143
### Enhancements
142144
143145
- Module management enhancements:

0 commit comments

Comments
 (0)