You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/operate/rs/monitoring/metrics_stream_engine.md
+56Lines changed: 56 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -59,3 +59,59 @@ If you are already using the existing scraping endpoint for integration, do the
59
59
1. Use the metrics tables in [this guide]({{<relref "/operate/rs/references/metrics/prometheus-metrics-v1-to-v2">}}) to transition from v1 metrics to equivalent v2 PromQL.
60
60
61
61
It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition.
62
+
63
+
## Best practices for monitoring
64
+
65
+
Follow these best practices when monitoring your Redis Enterprise Software cluster using the metrics stream engine.
66
+
67
+
### Monitor host-level metrics
68
+
69
+
For cluster health, resources, and node stability, monitor these metrics:
70
+
71
+
| Group | Metric | Why monitor | Unit |
72
+
|-------|--------|-------------|------|
73
+
| CPU utilization | `node_cpu_user`,<br />`node_cpu_system` | Detect CPU saturation from Redis or the OS that results in higher latency and queueing. | Seconds (counter) |
| Swap usage | <span class="break-all">`node_ephemeral_storage_free`</span> | Monitor memory and disk pressure in your setup. Sustained pressure leads to latency spikes. | Bytes (gauge) |
76
+
| Network traffic | <span class="break-all">`node_ingress_bytes`</span>,<br /><span class="break-all">`node_egress_bytes`</span> | Ensure the network interface is not saturated. Protects replication and client responsiveness. | Bytes (counter) |
77
+
| Disk space | <span class="break-all">`node_filesystem_avail_bytes`</span>,<br /><span class="break-all">`node_filesystem_size_bytes`</span> | Prevent persistence and logging outages from low disk space. | Bytes (gauge) |
78
+
| Cluster state | `has_quorum{…}` | Monitor whether quorum is maintained (1) or lost (0). | Boolean |
79
+
| | `node_metrics_up` | Monitor whether the node is connected and reporting to the cluster. | Gauge |
80
+
| Licensing | `license_shards_limit` | Track shard capacity limits by type (RAM or flash). | Count |
| Services – CPU | <span class="break-all">`namedprocess_namegroup_cpu_seconds_total`</span> | Identify abnormal CPU usage by platform services that can starve Redis, such as `alert_mgr`, `redis_mgr`, `dmc_proxy`. | Seconds (counter) |
83
+
| Services – memory | <span class="break-all">`namedprocess_namegroup_memory_bytes`</span> | Detect memory leaks or outliers in platform services, such as `alert_mgr`, `redis_mgr`, `dmc_proxy`. | Bytes (gauge) |
84
+
85
+
### Monitor database-level metrics
86
+
87
+
For database performance, availability, and efficiency, monitor the following metrics:
88
+
89
+
| Group | Metric | Why monitor | Unit |
90
+
|-------|--------|-------------|------|
91
+
| Memory | <span class="break-all">`redis_server_used_memory`</span> | Track actual data memory to prevent out-of-memory errors and evictions. | Bytes |
| Memory | `allocator_active` | Monitor bytes in active pages (includes external fragmentation). Use delta/ratio versus allocated to infer defraggable memory. | Bytes |
94
+
| Memory | <span class="break-all">`active_defrag_running`</span> | Monitor if defragmentation is active and the intended CPU %. High values can affect performance. | % (gauge) |
| High availability | <span class="break-all">`redis_server_master_repl_offset`</span> | Compute replica throughput and lag using deltas over time. | Bytes (counter) |
97
+
| High availability | <span class="break-all">`redis_server_master_link_status`</span> | Monitor replica link status (up or down) for early warning of high availability risk. | Status |
| Active-Active | <span class="break-all">`database_syncer_state`</span> | Monitor operational state for troubleshooting synchronization issues. | Gauge |
100
+
| Traffic – requests | <span class="break-all">`endpoint_read_requests`</span>,<br /><span class="break-all">`endpoint_write_requests`</span>,<br /><span class="break-all">`endpoint_other_requests`</span> | Monitor workload mix and spikes that drive capacity and latency. Total equals the sum of all three. | Counter |
101
+
| Traffic – responses | <span class="break-all">`endpoint_read_responses`</span>,<br /><span class="break-all">`endpoint_write_responses`</span>,<br /><span class="break-all">`endpoint_other_responses`</span> | Validate service responsiveness and symmetry with requests. | Counter |
102
+
| Traffic – bytes | <span class="break-all">`endpoint_ingress`</span>,<br /><span class="break-all">`endpoint_egress`</span> | Monitor size trends and watch for sudden growth that impacts egress costs or bandwidth. | Bytes (counter) |
103
+
| Egress queue | <span class="break-all">`endpoint_egress_pending`</span>,<br /><span class="break-all">`endpoint_egress_pending_discarded`</span> | Monitor back-pressure and drops that indicate network or client issues. | Bytes (counter) |
104
+
| Connections | <span class="break-all">`endpoint_client_connection`</span> | Monitor accepted connections over time and match against client rollouts or spikes. | Counter |
105
+
| Connections | <span class="break-all">`endpoint_client_connection_expired`</span> | Monitor connections closed due to TTL expiry, which can indicate idle policy or client issues. | Counter |
106
+
| Connections | <span class="break-all">`endpoint_longest_pipeline_histogram`</span> | Monitor long pipelines that can amplify latency bursts and detect misbehaving clients. | Histogram (count) |
107
+
| Connections | <span class="break-all">`endpoint_client_connections`</span>,<br /><span class="break-all">`endpoint_client_disconnections`</span>,<br /><span class="break-all">`endpoint_proxy_disconnections`</span> | Monitor connection churn and identify who closed the socket (client versus proxy). Current connections ≈ connections − disconnections. | Counter |
| Big / complex keys | <span class="break-all">`redis_server_<data_type>_<size_or_items>_<bucket>`</span> | Monitor oversized keys and cardinality that cause fragmentation, slow replication, and CPU spikes. Track to prevent incidents. Examples:<br /><span class="break-all">`strings_sizes_over_512M`</span>,<br /><span class="break-all">`zsets_items_over_8M`</span> | Gauge |
113
+
| Security – clients | <span class="break-all">`endpoint_client_expiration_refresh`</span>,<br /><span class="break-all">`endpoint_client_establishment_failures`</span> | Monitor unstable clients or problems with authentication or setup. | Counter |
114
+
| Security – LDAP | <span class="break-all">`endpoint_successful_ldap_authentication`</span>,<br /><span class="break-all">`endpoint_failed_ldap_authentication`</span>,<br /><span class="break-all">`endpoint_disconnected_ldap_client`</span> | Monitor authentication health and detect brute-force attacks or misconfigurations. | Counter |
Copy file name to clipboardExpand all lines: content/operate/rs/release-notes/rs-8-0-releases/rs-8-0-tba.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -138,6 +138,8 @@ The [metrics stream engine]({{<relref "/operate/rs/monitoring/metrics_stream_eng
138
138
139
139
- As part of the transition to the metrics stream engine, some internal cluster manager alerts were deprecated in favor of external monitoring solutions. See the [alerts transition plan]({{<relref "/operate/rs/references/alerts/alerts-v1-to-v2">}}) for guidance.
140
140
141
+
- See [Best practices for monitoring]({{<relref "/operate/rs/monitoring/metrics_stream_engine#best-practices-for-monitoring">}}) for a list of recommended metrics to monitor.
0 commit comments