From 8919f69ed5e8eb6e71104881cce1a21927d4ae4c Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Wed, 26 Feb 2025 15:40:45 -0600 Subject: [PATCH 1/9] DOC-4800 Started top-level RS monitoring section --- .../rs-prometheus-metrics-transition-plan.md | 260 +++++++ content/embeds/rs-prometheus-metrics-v2.md | 199 ++++++ .../_index.md | 1 + .../prometheus-metrics-definitions.md | 201 +----- .../prometheus-metrics-v1-to-v2.md | 262 +------- .../rs/{clusters => }/monitoring/_index.md | 4 +- .../metrics_stream_engine/_index.md | 15 + .../prometheus-metrics-v1-to-v2.md | 21 + .../prometheus-metrics-v2.md | 25 + .../prometheus_and_grafana.md | 173 +++++ .../rs/monitoring/v1_monitoring/_index.md | 15 + .../monitoring/v1_monitoring/observability.md | 632 ++++++++++++++++++ .../v1_monitoring/prometheus-metrics-v1.md | 282 ++++++++ .../v1_monitoring/prometheus_and_grafana.md | 172 +++++ 14 files changed, 1802 insertions(+), 460 deletions(-) create mode 100644 content/embeds/rs-prometheus-metrics-transition-plan.md create mode 100644 content/embeds/rs-prometheus-metrics-v2.md rename content/operate/rs/{clusters => }/monitoring/_index.md (98%) create mode 100644 content/operate/rs/monitoring/metrics_stream_engine/_index.md create mode 100644 content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md create mode 100644 content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md create mode 100644 content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md create mode 100644 content/operate/rs/monitoring/v1_monitoring/_index.md create mode 100644 content/operate/rs/monitoring/v1_monitoring/observability.md create mode 100644 content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md create mode 100644 content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md diff --git a/content/embeds/rs-prometheus-metrics-transition-plan.md b/content/embeds/rs-prometheus-metrics-transition-plan.md new file mode 100644 index 0000000000..536646e9c4 --- /dev/null +++ b/content/embeds/rs-prometheus-metrics-transition-plan.md @@ -0,0 +1,260 @@ +## Database metrics + +| V1 metric | Equivalent V2 PromQL | Description | +| --------- | :------------------- | :---------- | +| bdb_avg_latency | `sum by (db) (irate(endpoint_acc_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Average latency of operations on the database (seconds); returned only when there is traffic | +| bdb_avg_latency_max | `sum by (db) (irate(endpoint_acc_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Highest value of average latency of operations on the database (seconds); returned only when there is traffic | +| bdb_avg_read_latency | `sum by (db) (irate(endpoint_acc_read_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Average latency of read operations (seconds); returned only when there is traffic | +| bdb_avg_read_latency_max | `sum by (db) (irate(endpoint_acc_read_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Highest value of average latency of read operations (seconds); returned only when there is traffic | +| bdb_avg_write_latency | `sum by (db) (irate(endpoint_acc_write_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Average latency of write operations (seconds); returned only when there is traffic | +| bdb_avg_write_latency_max | `sum by (db) (irate(endpoint_acc_write_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Highest value of average latency of write operations (seconds); returned only when there is traffic | +| bdb_bigstore_shard_count | `sum((sum(label_replace(label_replace(namedprocess_namegroup_thread_count{groupname=~"redis-\d+", threadname=~"(speedb\|rocksdb).*"}, "redis", "$1", "groupname", "redis-(\d+)"), "driver", "$1", "threadname", "(speedb\|rocksdb).*")) by (redis, driver) > bool 0) * on (redis) group_left(db) redis_server_up) by (db, driver)` | Shard count by database and by storage engine (driver - rocksdb / speedb); Only for databases with Auto Tiering enabled | +| bdb_conns | `sum by(db) (endpoint_client_connections)` | Number of client connections to database | +| bdb_egress_bytes | `sum by(db) (irate(endpoint_egress_bytes[1m]))` | Rate of outgoing network traffic from the database (bytes/sec) | +| bdb_egress_bytes_max | `sum by(db) (irate(endpoint_egress_bytes[1m]))` | Highest value of the rate of outgoing network traffic from the database (bytes/sec) | +| bdb_evicted_objects | `sum by (db) (irate(redis_server_evicted_keys{role="master"}[1m]))` | Rate of key evictions from database (evictions/sec) | +| bdb_evicted_objects_max | `sum by (db) (irate(redis_server_evicted_keys{role="master"}[1m]))` | Highest value of the rate of key evictions from database (evictions/sec) | +| bdb_expired_objects | `sum by (db) (irate(redis_server_expired_keys{role="master"}[1m]))` | Rate keys expired in database (expirations/sec) | +| bdb_expired_objects_max | `sum by (db) (irate(redis_server_expired_keys{role="master"}[1m]))` | Highest value of the rate keys expired in database (expirations/sec) | +| bdb_fork_cpu_system | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system"}[1m]))` | % cores utilization in system mode for all Redis shard fork child processes of this database | +| bdb_fork_cpu_system_max | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system"}[1m]))` | Highest value of % cores utilization in system mode for all Redis shard fork child processes of this database | +| bdb_fork_cpu_user | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user"}[1m]))` | % cores utilization in user mode for all Redis shard fork child processes of this database | +| bdb_fork_cpu_user_max | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user"}[1m]))` | Highest value of % cores utilization in user mode for all Redis shard fork child processes of this database | +| bdb_ingress_bytes | `sum by(db) (irate(endpoint_ingress_bytes[1m]))` | Rate of incoming network traffic to database (bytes/sec) | +| bdb_ingress_bytes_max | `sum by(db) (irate(endpoint_ingress_bytes[1m]))` | Highest value of the rate of incoming network traffic to database (bytes/sec) | +| bdb_instantaneous_ops_per_sec | `sum by(db) (redis_server_instantaneous_ops_per_sec)` | Request rate handled by all shards of database (ops/sec) | +| bdb_main_thread_cpu_system | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", threadname=~"redis-server.*"}[1m]))` | % cores utilization in system mode for all Redis shard main threads of this database | +| bdb_main_thread_cpu_system_max | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", threadname=~"redis-server.*"}[1m]))` | Highest value of % cores utilization in system mode for all Redis shard main threads of this database | +| bdb_main_thread_cpu_user | `sum by(irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", threadname=~"redis-server.*"}[1m]))` | % cores utilization in user mode for all Redis shard main threads of this database | +| bdb_main_thread_cpu_user_max | `sum by(irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", threadname=~"redis-server.*"}[1m]))` | Highest value of % cores utilization in user mode for all Redis shard main threads of this database | +| bdb_mem_frag_ratio | `avg(redis_server_mem_fragmentation_ratio)` | RAM fragmentation ratio (RSS / allocated RAM) | +| bdb_mem_size_lua | `sum by(db) (redis_server_used_memory_lua)` | Redis lua scripting heap size (bytes) | +| bdb_memory_limit | `sum by(db) (redis_server_maxmemory)` | Configured RAM limit for the database | +| bdb_monitor_sessions_count | `sum by(db) (endpoint_monitor_sessions_count)` | Number of clients connected in monitor mode to the database | +| bdb_no_of_keys | `sum by (db) (redis_server_db_keys{role="master"})` | Number of keys in database | +| bdb_other_req | `sum by(db) (irate(endpoint_other_req[1m]))` | Rate of other (non read/write) requests on the database (ops/sec) | +| bdb_other_req_max | `sum by(db) (irate(endpoint_other_req[1m]))` | Highest value of the rate of other (non read/write) requests on the database (ops/sec) | +| bdb_other_res | `sum by(db) (irate(endpoint_other_res[1m]))` | Rate of other (non read/write) responses on the database (ops/sec) | +| bdb_other_res_max | `sum by(db) (irate(endpoint_other_res[1m]))` | Highest value of the rate of other (non read/write) responses on the database (ops/sec) | +| bdb_pubsub_channels | `sum by(db) (redis_server_pubsub_channels)` | Count the pub/sub channels with subscribed clients | +| bdb_pubsub_channels_max | `sum by(db) (redis_server_pubsub_channels)` | Highest value of count the pub/sub channels with subscribed clients | +| bdb_pubsub_patterns | `sum by(db) (redis_server_pubsub_patterns)` | Count the pub/sub patterns with subscribed clients | +| bdb_pubsub_patterns_max | `sum by(db) (redis_server_pubsub_patterns)` | Highest value of count the pub/sub patterns with subscribed clients | +| bdb_read_hits | `sum by (db) (irate(redis_server_keyspace_read_hits{role="master"}[1m]))` | Rate of read operations accessing an existing key (ops/sec) | +| bdb_read_hits_max | `sum by (db) (irate(redis_server_keyspace_read_hits{role="master"}[1m]))` | Highest value of the rate of read operations accessing an existing key (ops/sec) | +| bdb_read_misses | `sum by (db) (irate(redis_server_keyspace_read_misses{role="master"}[1m]))` | Rate of read operations accessing a non-existing key (ops/sec) | +| bdb_read_misses_max | `sum by (db) (irate(redis_server_keyspace_read_misses{role="master"}[1m]))` | Highest value of the rate of read operations accessing a non-existing key (ops/sec) | +| bdb_read_req | `sum by (db) (irate(endpoint_read_req[1m]))` | Rate of read requests on the database (ops/sec) | +| bdb_read_req_max | `sum by (db) (irate(endpoint_read_req[1m]))` | Highest value of the rate of read requests on the database (ops/sec) | +| bdb_read_res | `sum by(db) (irate(endpoint_read_res[1m]))` | Rate of read responses on the database (ops/sec) | +| bdb_read_res_max | `sum by(db) (irate(endpoint_read_res[1m]))` | Highest value of the rate of read responses on the database (ops/sec) | +| bdb_shard_cpu_system | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", role="master"}[1m]))` | % cores utilization in system mode for all Redis shard processes of this database | +| bdb_shard_cpu_system_max | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", role="master"}[1m]))` | Highest value of % cores utilization in system mode for all Redis shard processes of this database | +| bdb_shard_cpu_user | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", role="master"}[1m]))` | % cores utilization in user mode for the Redis shard process | +| bdb_shard_cpu_user_max | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", role="master"}[1m]))` | Highest value of % cores utilization in user mode for the Redis shard process | +| bdb_shards_used | `sum((sum(label_replace(label_replace(label_replace(namedprocess_namegroup_thread_count{groupname=~"redis-\d+"}, "redis", "$1", "groupname", "redis-(\d+)"), "shard_type", "flash", "threadname", "(bigstore).*"), "shard_type", "ram", "shard_type", "")) by (redis, shard_type) > bool 0) * on (redis) group_left(db) redis_server_up) by (db, shard_type)` | Used shard count by database and by shard type (ram / flash) | +| bdb_total_connections_received | `sum by(db) (irate(endpoint_total_connections_received[1m]))` | Rate of new client connections to database (connections/sec) | +| bdb_total_connections_received_max | `sum by(db) (irate(endpoint_total_connections_received[1m]))` | Highest value of the rate of new client connections to database (connections/sec) | +| bdb_total_req | `sum by (db) (irate(endpoint_total_req[1m]))` | Rate of all requests on the database (ops/sec) | +| bdb_total_req_max | `sum by (db) (irate(endpoint_total_req[1m]))` | Highest value of the rate of all requests on the database (ops/sec) | +| bdb_total_res | `sum by(db) (irate(endpoint_total_res[1m]))` | Rate of all responses on the database (ops/sec) | +| bdb_total_res_max | `sum by(db) (irate(endpoint_total_res[1m]))` | Highest value of the rate of all responses on the database (ops/sec) | +| bdb_up | `min by(db) (redis_up)` | Database is up and running | +| bdb_used_memory | `sum by (db) (redis_server_used_memory)` | Memory used by database (in BigRedis this includes flash) (bytes) | +| bdb_write_hits | `sum by (db) (irate(redis_server_keyspace_write_hits{role="master"}[1m]))` | Rate of write operations accessing an existing key (ops/sec) | +| bdb_write_hits_max | `sum by (db) (irate(redis_server_keyspace_write_hits{role="master"}[1m]))` | Highest value of the rate of write operations accessing an existing key (ops/sec) | +| bdb_write_misses | `sum by (db) (irate(redis_server_keyspace_write_misses{role="master"}[1m]))` | Rate of write operations accessing a non-existing key (ops/sec) | +| bdb_write_misses_max | `sum by (db) (irate(redis_server_keyspace_write_misses{role="master"}[1m]))` | Highest value of the rate of write operations accessing a non-existing key (ops/sec) | +| bdb_write_req | `sum by (db) (irate(endpoint_write_requests[1m]))` | Rate of write requests on the database (ops/sec) | +| bdb_write_req_max | `sum by (db) (irate(endpoint_write_requests[1m]))` | Highest value of the rate of write requests on the database (ops/sec) | +| bdb_write_res | `sum by(db) (irate(endpoint_write_responses[1m]))` | Rate of write responses on the database (ops/sec) | +| bdb_write_res_max | `sum by(db) (irate(endpoint_write_responses[1m]))` | Highest value of the rate of write responses on the database (ops/sec) | +| no_of_expires | `sum by(db) (redis_server_db_expires{role="master"})` | Current number of volatile keys in the database | + +## Node metrics + +| V1 metric | Equivalent V2 PromQL | Description | +| --------- | :------------------- | :---------- | +| node_available_flash | `node_available_flash_bytes` | Available flash in the node (bytes) | +| node_available_flash_no_overbooking | `node_available_flash_no_overbooking_bytes` | Available flash in the node (bytes), without taking into account overbooking | +| node_available_memory | `node_available_memory_bytes` | Amount of free memory in the node (bytes) that is available for database provisioning | +| node_available_memory_no_overbooking | `node_available_memory_no_overbooking_bytes` | Available RAM in the node (bytes) without taking into account overbooking | +| node_avg_latency | `sum by (proxy) (irate(endpoint_acc_latency[1m])) / sum by (proxy) (irate(endpoint_total_started_res[1m]))` | Average latency of requests handled by endpoints on the node in milliseconds; returned only when there is traffic | +| node_bigstore_free | `node_bigstore_free_bytes` | Sum of free space of back-end flash (used by flash database's [BigRedis]) on all cluster nodes (bytes); returned only when BigRedis is enabled | +| node_bigstore_iops | `node_flash_reads_total + node_flash_writes_total` | Rate of I/O operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | +| node_bigstore_kv_ops | `sum by (node) (irate(redis_server_big_io_dels[1m]) + irate(redis_server_big_io_reads[1m]) + irate(redis_server_big_io_writes[1m]))` | Rate of value read/write operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | +| node_bigstore_throughput | `sum by (node) (irate(redis_server_big_io_read_bytes[1m]) + irate(redis_server_big_io_write_bytes[1m]))` | Throughput I/O operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (bytes/sec); returned only when BigRedis is enabled | +| node_cert_expiration_seconds | `node_cert_expires_in_seconds` | Certificate expiration (in seconds) per given node; read more about [certificates in Redis Enterprise]({{< relref "/operate/rs/security/certificates" >}}) and [monitoring certificates]({{< relref "/operate/rs/security/certificates/monitor-certificates" >}}) | +| node_conns | `sum by (node) (endpoint_client_connections)` | Number of clients connected to endpoints on node | +| node_cpu_idle | `avg by (node) (irate(node_cpu_seconds_total{mode="idle"}[1m]))` | CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_idle_max | N/A | Highest value of CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_idle_median | N/A | Average value of CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_idle_min | N/A | Lowest value of CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_system | `avg by (node) (irate(node_cpu_seconds_total{mode="system"}[1m]))` | CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_system_max | N/A | Highest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_system_median | N/A | Average value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_system_min | N/A | Lowest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_user | `avg by (node) (irate(node_cpu_seconds_total{mode="user"}[1m]))` | CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | +| node_cpu_user_max | N/A | Highest value of CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | +| node_cpu_user_median | N/A | Average value of CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | +| node_cpu_user_min | N/A | Lowest value of CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | +| node_cur_aof_rewrites | `sum by (cluster, node) (redis_server_aof_rewrite_in_progress)` | Number of AOF rewrites that are currently performed by shards on this node | +| node_egress_bytes | `irate(node_network_transmit_bytes_total{device=""}[1m])` | Rate of outgoing network traffic to node (bytes/sec) | +| node_egress_bytes_max | N/A | Highest value of the rate of outgoing network traffic to node (bytes/sec) | +| node_egress_bytes_median | N/A | Average value of the rate of outgoing network traffic to node (bytes/sec) | +| node_egress_bytes_min | N/A | Lowest value of the rate of outgoing network traffic to node (bytes/sec) | +| node_ephemeral_storage_avail | `node_ephemeral_storage_avail_bytes` | Disk space available to RLEC processes on configured ephemeral disk (bytes) | +| node_ephemeral_storage_free | `node_ephemeral_storage_free_bytes` | Free disk space on configured ephemeral disk (bytes) | +| node_free_memory | `node_memory_MemFree_bytes` | Free memory in the node (bytes) | +| node_ingress_bytes | `irate(node_network_receive_bytes_total{device=""}[1m])` | Rate of incoming network traffic to node (bytes/sec) | +| node_ingress_bytes_max | N/A | Highest value of the rate of incoming network traffic to node (bytes/sec) | +| node_ingress_bytes_median | N/A | Average value of the rate of incoming network traffic to node (bytes/sec) | +| node_ingress_bytes_min | N/A | Lowest value of the rate of incoming network traffic to node (bytes/sec) | +| node_persistent_storage_avail | `node_persistent_storage_avail_bytes` | Disk space available to RLEC processes on configured persistent disk (bytes) | +| node_persistent_storage_free | `node_persistent_storage_free_bytes` | Free disk space on configured persistent disk (bytes) | +| node_provisional_flash | `node_provisional_flash_bytes` | Amount of flash available for new shards on this node, taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | +| node_provisional_flash_no_overbooking | `node_provisional_flash_no_overbooking_bytes` | Amount of flash available for new shards on this node, without taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | +| node_provisional_memory | `node_provisional_memory_bytes` | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases | +| node_provisional_memory_no_overbooking | `node_provisional_memory_no_overbooking_bytes` | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases, without taking into account overbooking | +| node_total_req | `sum by (cluster, node) (irate(endpoint_total_req[1m]))` | Request rate handled by endpoints on node (ops/sec) | +| node_up | `node_metrics_up` | Node is part of the cluster and is connected | + +## Cluster metrics + +| V1 metric | Equivalent V2 PromQL | Description | +| --------- | :------------------- | :---------- | +| cluster_shards_limit | `license_shards_limit` | Total shard limit by the license by shard type (ram / flash) | + +## Proxy metrics + +| V1 metric | Equivalent V2 PromQL | Description | +| --------- | :------------------- | :---------- | +| listener_acc_latency | N/A | Accumulative latency (sum of the latencies) of all types of commands on the database. For the average latency, divide this value by listener_total_res | +| listener_acc_latency_max | N/A | Highest value of accumulative latency of all types of commands on the database | +| listener_acc_other_latency | N/A | Accumulative latency (sum of the latencies) of commands that are a type "other" on the database. For the average latency, divide this value by listener_other_res | +| listener_acc_other_latency_max | N/A | Highest value of accumulative latency of commands that are a type "other" on the database | +| listener_acc_read_latency | N/A | Accumulative latency (sum of the latencies) of commands that are a type "read" on the database. For the average latency, divide this value by listener_read_res | +| listener_acc_read_latency_max | N/A | Highest value of accumulative latency of commands that are a type "read" on the database | +| listener_acc_write_latency | N/A | Accumulative latency (sum of the latencies) of commands that are a type "write" on the database. For the average latency, divide this value by listener_write_res | +| listener_acc_write_latency_max | N/A | Highest value of accumulative latency of commands that are a type "write" on the database | +| listener_auth_cmds | N/A | Number of memcached AUTH commands sent to the database | +| listener_auth_cmds_max | N/A | Highest value of the number of memcached AUTH commands sent to the database | +| listener_auth_errors | N/A | Number of error responses to memcached AUTH commands | +| listener_auth_errors_max | N/A | Highest value of the number of error responses to memcached AUTH commands | +| listener_cmd_flush | N/A | Number of memcached FLUSH_ALL commands sent to the database | +| listener_cmd_flush_max | N/A | Highest value of the number of memcached FLUSH_ALL commands sent to the database | +| listener_cmd_get | N/A | Number of memcached GET commands sent to the database | +| listener_cmd_get_max | N/A | Highest value of the number of memcached GET commands sent to the database | +| listener_cmd_set | N/A | Number of memcached SET commands sent to the database | +| listener_cmd_set_max | N/A | Highest value of the number of memcached SET commands sent to the database | +| listener_cmd_touch | N/A | Number of memcached TOUCH commands sent to the database | +| listener_cmd_touch_max | N/A | Highest value of the number of memcached TOUCH commands sent to the database | +| listener_conns | N/A | Number of clients connected to the endpoint | +| listener_egress_bytes | N/A | Rate of outgoing network traffic to the endpoint (bytes/sec) | +| listener_egress_bytes_max | N/A | Highest value of the rate of outgoing network traffic to the endpoint (bytes/sec) | +| listener_ingress_bytes | N/A | Rate of incoming network traffic to the endpoint (bytes/sec) | +| listener_ingress_bytes_max | N/A | Highest value of the rate of incoming network traffic to the endpoint (bytes/sec) | +| listener_last_req_time | N/A | Time of last command sent to the database | +| listener_last_res_time | N/A | Time of last response sent from the database | +| listener_max_connections_exceeded | `irate(endpoint_maximal_connections_exceeded[1m])` | Number of times the number of clients connected to the database at the same time has exceeded the max limit | +| listener_max_connections_exceeded_max | N/A | Highest value of the number of times the number of clients connected to the database at the same time has exceeded the max limit | +| listener_monitor_sessions_count | N/A | Number of clients connected in monitor mode to the endpoint | +| listener_other_req | N/A | Rate of other (non-read/write) requests on the endpoint (ops/sec) | +| listener_other_req_max | N/A | Highest value of the rate of other (non-read/write) requests on the endpoint (ops/sec) | +| listener_other_res | N/A | Rate of other (non-read/write) responses on the endpoint (ops/sec) | +| listener_other_res_max | N/A | Highest value of the rate of other (non-read/write) responses on the endpoint (ops/sec) | +| listener_other_started_res | N/A | Number of responses sent from the database of type "other" | +| listener_other_started_res_max | N/A | Highest value of the number of responses sent from the database of type "other" | +| listener_read_req | `irate(endpoint_read_requests[1m])` | Rate of read requests on the endpoint (ops/sec) | +| listener_read_req_max | N/A | Highest value of the rate of read requests on the endpoint (ops/sec) | +| listener_read_res | `irate(endpoint_read_responses[1m])` | Rate of read responses on the endpoint (ops/sec) | +| listener_read_res_max | N/A | Highest value of the rate of read responses on the endpoint (ops/sec) | +| listener_read_started_res | N/A | Number of responses sent from the database of type "read" | +| listener_read_started_res_max | N/A | Highest value of the number of responses sent from the database of type "read" | +| listener_total_connections_received | `irate(endpoint_total_connections_received[1m])` | Rate of new client connections to the endpoint (connections/sec) | +| listener_total_connections_received_max | N/A | Highest value of the rate of new client connections to the endpoint (connections/sec) | +| listener_total_req | N/A | Request rate handled by the endpoint (ops/sec) | +| listener_total_req_max | N/A | Highest value of the rate of all requests on the endpoint (ops/sec) | +| listener_total_res | N/A | Rate of all responses on the endpoint (ops/sec) | +| listener_total_res_max | N/A | Highest value of the rate of all responses on the endpoint (ops/sec) | +| listener_total_started_res | N/A | Number of responses sent from the database of all types | +| listener_total_started_res_max | N/A | Highest value of the number of responses sent from the database of all types | +| listener_write_req | `irate(endpoint_write_requests[1m])` | Rate of write requests on the endpoint (ops/sec) | +| listener_write_req_max | N/A | Highest value of the rate of write requests on the endpoint (ops/sec) | +| listener_write_res | `irate(endpoint_write_responses[1m])` | Rate of write responses on the endpoint (ops/sec) | +| listener_write_res_max | N/A | Highest value of the rate of write responses on the endpoint (ops/sec) | +| listener_write_started_res | N/A | Number of responses sent from the database of type "write" | +| listener_write_started_res_max | N/A | Highest value of the number of responses sent from the database of type "write" | + +## Replication metrics + +| V1 metric | Equivalent V2 PromQL | Description | +| --------- | :------------------- | :---------- | +| bdb_replicaof_syncer_ingress_bytes | `rate(replica_src_ingress_bytes[1m])` | Rate of compressed incoming network traffic to a Replica Of database (bytes/sec) | +| bdb_replicaof_syncer_ingress_bytes_decompressed | `rate(replica_src_ingress_bytes_decompressed[1m])` | Rate of decompressed incoming network traffic to a Replica Of database (bytes/sec) | +| bdb_replicaof_syncer_local_ingress_lag_time | `database_syncer_lag_ms{syncer_type="replicaof"}` | Lag time between the source and the destination for Replica Of traffic (ms) | +| bdb_replicaof_syncer_status | `database_syncer_current_status{syncer_type="replicaof"}` | Syncer status for Replica Of traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | +| bdb_crdt_syncer_ingress_bytes | `rate(crdt_src_ingress_bytes[1m])` | Rate of compressed incoming network traffic to CRDB (bytes/sec) | +| bdb_crdt_syncer_ingress_bytes_decompressed | `rate(crdt_src_ingress_bytes_decompressed[1m])` | Rate of decompressed incoming network traffic to CRDB (bytes/sec) | +| bdb_crdt_syncer_local_ingress_lag_time | `database_syncer_lag_ms{syncer_type="crdt"}` | Lag time between the source and the destination (ms) for CRDB traffic | +| bdb_crdt_syncer_status | `database_syncer_current_status{syncer_type="crdt"}` | Syncer status for CRDB traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | + +## Shard metrics + +| V1 metric | Equivalent V2 PromQL | Description | +| --------- | :------------------- | :---------- | +| redis_active_defrag_running | `redis_server_active_defrag_running` | Automatic memory defragmentation current aggressiveness (% cpu) | +| redis_allocator_active | `redis_server_allocator_active` | Total used memory, including external fragmentation | +| redis_allocator_allocated | `redis_server_allocator_allocated` | Total allocated memory | +| redis_allocator_resident | `redis_server_allocator_resident` | Total resident memory (RSS) | +| redis_aof_last_cow_size | `redis_server_aof_last_cow_size` | Last AOFR, CopyOnWrite memory | +| redis_aof_rewrite_in_progress | `redis_server_aof_rewrite_in_progress` | The number of simultaneous AOF rewrites that are in progress | +| redis_aof_rewrites | `redis_server_aof_rewrites` | Number of AOF rewrites this process executed | +| redis_aof_delayed_fsync | `redis_server_aof_delayed_fsync` | Number of times an AOF fsync caused delays in the main Redis thread (inducing latency); this can indicate that the disk is slow or overloaded | +| redis_blocked_clients | `redis_server_blocked_clients` | Count the clients waiting on a blocking call | +| redis_connected_clients | `redis_server_connected_clients` | Number of client connections to the specific shard | +| redis_connected_slaves | `redis_server_connected_slaves` | Number of connected replicas | +| redis_db0_avg_ttl | `redis_server_db0_avg_ttl` | Average TTL of all volatile keys | +| redis_db0_expires | `redis_server_expired_keys` | Total count of volatile keys | +| redis_db0_keys | `redis_server_db0_keys` | Total key count | +| redis_evicted_keys | `redis_server_evicted_keys` | Keys evicted so far (since restart) | +| redis_expire_cycle_cpu_milliseconds | `redis_server_expire_cycle_cpu_milliseconds` | The cumulative amount of time spent on active expiry cycles | +| redis_expired_keys | `redis_server_expired_keys` | Keys expired so far (since restart) | +| redis_forwarding_state | `redis_server_forwarding_state` | Shard forwarding state (on or off) | +| redis_keys_trimmed | `redis_server_keys_trimmed` | The number of keys that were trimmed in the current or last resharding process | +| redis_keyspace_read_hits | `redis_server_keyspace_read_hits` | Number of read operations accessing an existing keyspace | +| redis_keyspace_read_misses | `redis_server_keyspace_read_misses` | Number of read operations accessing a non-existing keyspace | +| redis_keyspace_write_hits | `redis_server_keyspace_write_hits` | Number of write operations accessing an existing keyspace | +| redis_keyspace_write_misses | `redis_server_keyspace_write_misses` | Number of write operations accessing a non-existing keyspace | +| redis_master_link_status | `redis_server_master_link_status` | Indicates if the replica is connected to its master | +| redis_master_repl_offset | `redis_server_master_repl_offset` | Number of bytes sent to replicas by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_master_sync_in_progress | `redis_server_master_sync_in_progress` | The master shard is synchronizing (1 true; 0 false) | +| redis_max_process_mem | `redis_server_max_process_mem` | Current memory limit configured by redis_mgr according to node free memory | +| redis_maxmemory | `redis_server_maxmemory` | Current memory limit configured by redis_mgr according to database memory limits | +| redis_mem_aof_buffer | `redis_server_mem_aof_buffer` | Current size of AOF buffer | +| redis_mem_clients_normal | `redis_server_mem_clients_normal` | Current memory used for input and output buffers of non-replica clients | +| redis_mem_clients_slaves | `redis_server_mem_clients_slaves` | Current memory used for input and output buffers of replica clients | +| redis_mem_fragmentation_ratio | `redis_server_mem_fragmentation_ratio` | Memory fragmentation ratio (1.3 means 30% overhead) | +| redis_mem_not_counted_for_evict | `redis_server_mem_not_counted_for_evict` | Portion of used_memory (in bytes) that's not counted for eviction and OOM error | +| redis_mem_replication_backlog | `redis_server_mem_replication_backlog` | Size of replication backlog | +| redis_module_fork_in_progress | `redis_server_module_fork_in_progress` | A binary value that indicates if there is an active fork spawned by a module (1) or not (0) | +| redis_process_cpu_system_seconds_total | `namedprocess_namegroup_cpu_seconds_total{mode="system"}` | Shard process system CPU time spent in seconds | +| redis_process_cpu_usage_percent | `namedprocess_namegroup_cpu_seconds_total{mode=~"system\|user"}` | Shard process CPU usage percentage | +| redis_process_cpu_user_seconds_total | `namedprocess_namegroup_cpu_seconds_total{mode="user"}` | Shard user CPU time spent in seconds | +| redis_process_main_thread_cpu_system_seconds_total | `namedprocess_namegroup_thread_cpu_seconds_total{mode="system",threadname="redis-server"}` | Shard main thread system CPU time spent in seconds | +| redis_process_main_thread_cpu_user_seconds_total | `namedprocess_namegroup_thread_cpu_seconds_total{mode="user",threadname="redis-server"}` | Shard main thread user CPU time spent in seconds | +| redis_process_max_fds | `max(namedprocess_namegroup_open_filedesc)` | Shard maximum number of open file descriptors | +| redis_process_open_fds | `namedprocess_namegroup_open_filedesc` | Shard number of open file descriptors | +| redis_process_resident_memory_bytes | `namedprocess_namegroup_memory_bytes{memtype="resident"}` | Shard resident memory size in bytes | +| redis_process_start_time_seconds | `namedprocess_namegroup_oldest_start_time_seconds` | Shard start time of the process since unix epoch in seconds | +| redis_process_virtual_memory_bytes | `namedprocess_namegroup_memory_bytes{memtype="virtual"}` | Shard virtual memory in bytes | +| redis_rdb_bgsave_in_progress | `redis_server_rdb_bgsave_in_progress` | Indication if bgsave is currently in progress | +| redis_rdb_last_cow_size | `redis_server_rdb_last_cow_size` | Last bgsave (or SYNC fork) used CopyOnWrite memory | +| redis_rdb_saves | `redis_server_rdb_saves` | Total count of bgsaves since the process was restarted (including replica fullsync and persistence) | +| redis_repl_touch_bytes | `redis_server_repl_touch_bytes` | Number of bytes sent to replicas as TOUCH commands by the shard as a result of a READ command that was processed; calculate the throughput for a time period by comparing the value at different times | +| redis_total_commands_processed | `redis_server_total_commands_processed` | Number of commands processed by the shard; calculate the number of commands for a time period by comparing the value at different times | +| redis_total_connections_received | `redis_server_total_connections_received` | Number of connections received by the shard; calculate the number of connections for a time period by comparing the value at different times | +| redis_total_net_input_bytes | `redis_server_total_net_input_bytes` | Number of bytes received by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_total_net_output_bytes | `redis_server_total_net_output_bytes` | Number of bytes sent by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_up | `redis_server_up` | Shard is up and running | +| redis_used_memory | `redis_server_used_memory` | Memory used by shard (in BigRedis this includes flash) (bytes) | diff --git a/content/embeds/rs-prometheus-metrics-v2.md b/content/embeds/rs-prometheus-metrics-v2.md new file mode 100644 index 0000000000..344ae8c85d --- /dev/null +++ b/content/embeds/rs-prometheus-metrics-v2.md @@ -0,0 +1,199 @@ +## Database metrics + +| Metric | Type | Description | +| :-------- | :--- | :---------- | +| endpoint_client_connections | counter | Number of client connection establishment events | +| endpoint_client_disconnections | counter | Number of client disconnections initiated by the client | +| endpoint_client_connection_expired | counter | Total number of client connections with expired TTL (Time To Live) | +| endpoint_client_establishment_failures | counter | Number of client connections that failed to establish properly | +| endpoint_client_expiration_refresh | counter | Number of expiration time changes of clients | +| endpoint_client_tracking_off_requests | counter | Total number of `CLIENT TRACKING OFF` requests | +| endpoint_client_tracking_on_requests | counter | Total number of `CLIENT TRACKING ON` requests | +| endpoint_disconnected_cba_client | counter | Number of certificate-based clients disconnected | +| endpoint_disconnected_ldap_client | counter | Number of LDAP clients disconnected | +| endpoint_disconnected_user_password_client | counter | Number of user&password clients disconnected | +| endpoint_disposed_commands_after_client_caching | counter | Total number of client caching commands that were disposed due to misuse | +| endpoint_egress | counter | Number of egress bytes | +| endpoint_egress_pending | counter | Number of send-pending bytes | +| endpoint_egress_pending_discarded | counter | Number of send-pending bytes that were discarded due to disconnection | +| endpoint_failed_cba_authentication | counter | Number of clients that failed certificate-based authentication | +| endpoint_failed_ldap_authentication | counter | Number of clients that failed LDAP authentication | +| endpoint_failed_user_password_authentication | counter | Number of clients that failed user password authentication | +| endpoint_ingress | counter | Number of ingress bytes | +| endpoint_longest_pipeline_histogram | counter | Tracks the distribution of longest observed pipeline lengths, where a pipeline is a sequence of client commands sent without waiting for responses. | +| endpoint_other_requests | counter | Number of other requests | +| endpoint_other_requests_latency_histogram | histogram | Latency (in µs) histogram of other commands | +| endpoint_other_requests_latency_histogram_bucket | histogram | Latency histograms for commands other than read or write commands. Can be used to represent different latency percentiles.
p99.9 example:
`histogram_quantile(0.999, sum(rate(endpoint_other_requests_latency_histogram_bucket{cluster="$cluster", db="$db"}[$__rate_interval]) ) by (le, db))` | +| endpoint_other_responses | counter | Number of other responses | +| endpoint_proxy_disconnections | counter | Number of client disconnections initiated by the proxy | +| endpoint_read_requests | counter | Number of read requests | +| endpoint_read_requests_latency_histogram | histogram | Latency (in µs) histogram of read commands | +| endpoint_read_requests_latency_histogram_bucket | histogram | Latency histograms for read commands. Can be used to represent different latency percentiles.
p99.9 example:
`histogram_quantile(0.999, sum(rate(endpoint_read_requests_latency_histogram_bucket{cluster="$cluster", db="$db"}[$__rate_interval]) ) by (le, db))` | +| endpoint_read_responses | counter | Number of read responses | +| endpoint_successful_cba_authentication | counter | Number of clients that successfully authenticated with certificate-based authentication | +| endpoint_successful_ldap_authentication | counter | Number of clients that successfully authenticated with LDAP | +| endpoint_successful_user_password_authentication | counter | Number of clients that successfully authenticated with user&password | +| endpoint_write_requests | counter | Number of write requests | +| endpoint_write_requests_latency_histogram | histogram | Latency (in µs) histogram of write commands | +| endpoint_write_requests_latency_histogram_bucket | histogram | Latency histograms for write commands. Can be used to represent different latency percentiles.
p99.9 example:
`histogram_quantile(0.999, sum(rate(endpoint_write_requests_latency_histogram_bucket{cluster="$cluster", db="$db"}[$__rate_interval]) ) by (le, db))` | +| endpoint_write_responses | counter | Number of write responses | + +## Node metrics + +| Metric | Type |Description | +| :-------- | :--- | :---------- | +| node_available_flash_bytes | gauge | Available flash in the node (bytes) | +| node_available_flash_no_overbooking_bytes | gauge | Available flash in the node (bytes), without taking into account overbooking | +| node_available_memory_bytes | gauge | Amount of free memory in the node (bytes) that is available for database provisioning | +| node_available_memory_no_overbooking_bytes | gauge | Available RAM in the node (bytes) without taking into account overbooking | +| node_bigstore_free_bytes | gauge | Sum of free space of back-end flash (used by flash database's [BigRedis]) on all cluster nodes (bytes); returned only when BigRedis is enabled | +| node_cert_expires_in_seconds | gauge | Certificate expiration (in seconds) per given node; read more about [certificates in Redis Enterprise]({{< relref "/operate/rs/security/certificates" >}}) and [monitoring certificates]({{< relref "/operate/rs/security/certificates/monitor-certificates" >}}) | +| node_ephemeral_storage_avail_bytes | gauge | Disk space available to RLEC processes on configured ephemeral disk (bytes) | +| node_ephemeral_storage_free_bytes | gauge | Free disk space on configured ephemeral disk (bytes) | +| node_memory_MemFree_bytes | gauge | Free memory in the node (bytes) | +| node_persistent_storage_avail_bytes | gauge | Disk space available to RLEC processes on configured persistent disk (bytes) | +| node_persistent_storage_free_bytes | gauge | Free disk space on configured persistent disk (bytes) | +| node_provisional_flash_bytes | gauge | Amount of flash available for new shards on this node, taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | +| node_provisional_flash_no_overbooking_bytes | gauge | Amount of flash available for new shards on this node, without taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | +| node_provisional_memory_bytes | gauge | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases | +| node_provisional_memory_no_overbooking_bytes | gauge | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases, without taking into account overbooking | +| node_metrics_up | gauge | Node is part of the cluster and is connected | + +## Cluster metrics + +| Metric | Type | Description | +| :-------- | :--- | :---------- | +| generation{cluster_wd=} | gauge| Generation number of the specific cluster_wd| +| has_qourum{cluster_wd=, has_witness_disk=BOOL} | gauge| Has_qourum = 1
No quorum = 0 | +| is_primary{cluster_wd=} | gauge| primary = 1
secondary = 0 | +| license_shards_limit | gauge | Total shard limit by the license by shard type (ram / flash) | +| total_live_nodes_count{cluster_wd=} | gauge| Number of live nodes| +| total_node_count{cluster_wd=} | gauge| Number of nodes | +| total_primary_selection_ended{cluster_wd=} | counter | Monotonic counter for each selection process that ended | +| total_primary_selections{cluster_wd=} | counter | Monotonic counter for each selection process that started| + +## Replication metrics + +| Metric | Type | Description | +| :-------- | :--- | :---------- | +| database_syncer_config | gauge | Used as a placeholder for configuration labels | +| database_syncer_current_status | gauge | Syncer status for traffic; 0 = in-sync, 2 = out of sync | +| database_syncer_dst_connectivity_state | gauge | Destination connectivity state | +| database_syncer_dst_connectivity_state_ms | gauge | Destination connectivity state duration | +| database_syncer_dst_lag | gauge | Lag in milliseconds between the syncer and the destination | +| database_syncer_dst_repl_offset | gauge | Offset of the last command acknowledged | +| database_syncer_flush_counter | gauge | Number of destination flushes | +| database_syncer_ingress_bytes | gauge | Number of bytes read from source shard | +| database_syncer_ingress_bytes_decompressed | gauge | Number of bytes read from source shard | +| database_syncer_internal_state | gauge | Internal state of the syncer | +| database_syncer_lag_ms | gauge | Lag time between the source and the destination for traffic in milliseconds | +| database_syncer_rdb_size | gauge | The source's RDB size in bytes to be transferred during the syncing phase | +| database_syncer_rdb_transferred | gauge | Number of bytes transferred from the source's RDB during the syncing phase | +| database_syncer_src_connectivity_state | gauge | Source connectivity state | +| database_syncer_src_connectivity_state_ms | gauge | Source connectivity state duration | +| database_syncer_src_repl_offset | gauge | Last known source offset | +| database_syncer_state | gauge | Internal state of the shard syncer | +| database_syncer_syncer_repl_offset | gauge | Offset of the last command handled by the syncer | +| database_syncer_total_requests | gauge | Number of destination writes | +| database_syncer_total_responses | gauge | Number of destination writes acknowledged | + +## Shard metrics + +| Metric | Description | +| :-------- | :---------- | +| redis_server_active_defrag_running | Automatic memory defragmentation current aggressiveness (% cpu) | +| redis_server_allocator_active | Total used memory, including external fragmentation | +| redis_server_allocator_allocated | Total allocated memory | +| redis_server_allocator_resident | Total resident memory (RSS) | +| redis_server_aof_last_cow_size | Last AOFR, CopyOnWrite memory | +| redis_server_aof_rewrite_in_progress | The number of simultaneous AOF rewrites that are in progress | +| redis_server_aof_rewrites | Number of AOF rewrites this process executed | +| redis_server_aof_delayed_fsync | Number of times an AOF fsync caused delays in the main Redis thread (inducing latency); this can indicate that the disk is slow or overloaded | +| redis_server_blocked_clients | Count the clients waiting on a blocking call | +| redis_server_connected_clients | Number of client connections to the specific shard | +| redis_server_connected_slaves | Number of connected replicas | +| redis_server_db0_avg_ttl | Average TTL of all volatile keys | +| redis_server_expired_keys | Total count of volatile keys | +| redis_server_db0_keys | Total key count | +| redis_server_evicted_keys | Keys evicted so far (since restart) | +| redis_server_expire_cycle_cpu_milliseconds | The cumulative amount of time spent on active expiry cycles | +| redis_server_expired_keys | Keys expired so far (since restart) | +| redis_server_forwarding_state | Shard forwarding state (on or off) | +| redis_server_keys_trimmed | The number of keys that were trimmed in the current or last resharding process | +| redis_server_keyspace_read_hits | Number of read operations accessing an existing keyspace | +| redis_server_keyspace_read_misses | Number of read operations accessing a non-existing keyspace | +| redis_server_keyspace_write_hits | Number of write operations accessing an existing keyspace | +| redis_server_keyspace_write_misses | Number of write operations accessing a non-existing keyspace | +| redis_server_master_link_status | Indicates if the replica is connected to its master | +| redis_server_master_repl_offset | Number of bytes sent to replicas by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_server_master_sync_in_progress | The primary shard is synchronizing (1 true; 0 false) | +| redis_server_max_process_mem | Current memory limit configured by redis_mgr according to node free memory | +| redis_server_maxmemory | Current memory limit configured by redis_mgr according to database memory limits | +| redis_server_mem_aof_buffer | Current size of AOF buffer | +| redis_server_mem_clients_normal | Current memory used for input and output buffers of non-replica clients | +| redis_server_mem_clients_slaves | Current memory used for input and output buffers of replica clients | +| redis_server_mem_fragmentation_ratio | Memory fragmentation ratio (1.3 means 30% overhead) | +| redis_server_mem_not_counted_for_evict | Portion of used_memory (in bytes) that's not counted for eviction and OOM error | +| redis_server_mem_replication_backlog | Size of replication backlog | +| redis_server_module_fork_in_progress | A binary value that indicates if there is an active fork spawned by a module (1) or not (0) | +| namedprocess_namegroup_cpu_seconds_total | Shard process CPU usage in seconds | +| namedprocess_namegroup_thread_cpu_seconds_total | Shard main thread CPU time spent in seconds | +| namedprocess_namegroup_open_filedesc | Shard number of open file descriptors | +| namedprocess_namegroup_memory_bytes | Shard memory size in bytes | +| namedprocess_namegroup_oldest_start_time_seconds | Shard start time of the process since unix epoch in seconds | +| redis_server_rdb_bgsave_in_progress | Indication if bgsave is currently in progress | +| redis_server_rdb_last_cow_size | Last bgsave (or SYNC fork) used CopyOnWrite memory | +| redis_server_rdb_saves | Total count of bgsaves since the process was restarted (including replica fullsync and persistence) | +| redis_server_repl_touch_bytes | Number of bytes sent to replicas as TOUCH commands by the shard as a result of a READ command that was processed; calculate the throughput for a time period by comparing the value at different times | +| redis_server_total_commands_processed | Number of commands processed by the shard; calculate the number of commands for a time period by comparing the value at different times | +| redis_server_total_connections_received | Number of connections received by the shard; calculate the number of connections for a time period by comparing the value at different times | +| redis_server_total_net_input_bytes | Number of bytes received by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_server_total_net_output_bytes | Number of bytes sent by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_server_up | Shard is up and running | +| redis_server_used_memory | Memory used by shard (in BigRedis this includes flash) (bytes) | +| redis_server_search_number_of_indexes | Total number of indexes in the shard [1](#tnote-1) | +| redis_server_search_number_of_active_indexes | The total number of indexes running a background indexing and/or background query processing operation. Background indexing refers to vector ingestion process, or in-progress background indexer. [1](#tnote-1) | +| redis_server_search_number_of_active_indexes_running_queries | Total count of indexes currently running a background query process. [1](#tnote-1) | +| redis_server_search_number_of_active_indexes_indexing | Total count of indexes currently undergoing a background indexing process. Background indexing refers to vector ingestion process, or in-progress background indexer. This metric is limited by the number of WORKER threads allocated for writing operations + the number of indexes. [1](#tnote-1) | +| redis_server_search_total_active_write_threads | Total count of background write (indexing) processes currently running in the shard. Background indexing refers to vector ingestion process, or in-progress background indexer. This metric is limited by the number of threads allocated for writing operations. [1](#tnote-1) | +| redis_server_search_fields_text_Text | The total number of `TEXT` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_text_Sortable | The total number of `SORTABLE TEXT` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_text_NoIndex | The total number of `NOINDEX TEXT` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_numeric_Numeric | The total number of `NUMERIC` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_numeric_Sortable | The total number of `SORTABLE NUMERIC` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_numeric_NoIndex | The total number of `NOINDEX NUMERIC` fields across all indexes in the shard, which are used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_tag_Tag | The total number of `TAG` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_tag_Sortable | The total number of `SORTABLE TAG` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_tag_NoIndex | The total number of `NOINDEX TAG` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_tag_CaseSensitive | The total number of `CASESENSITIVE TAG` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_geo_Geo | The total number of `GEO` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_geo_Sortable | The total number of `SORTABLE GEO` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_geo_NoIndex | The total number of `NOINDEX GEO` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | +| redis_server_search_fields_vector_Vector | The total number of `VECTOR` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_vector_Flat | The total number of `FLAT VECTOR` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_vector_HNSW | The total number of `HNSW VECTOR` fields across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_fields_geoshape_Geoshape | The total number of `GEOSHAPE` fields across all indexes in the shard. [2](#tnote-2) | +| redis_server_search_fields_geoshape_Sortable | The total number of `SORTABLE GEOSHAPE` fields across all indexes in the shard. This field appears only if its value is larger than 0. [2](#tnote-2) | +| redis_server_search_fields_geoshape_NoIndex | The total number of `NOINDEX GEOSHAPE` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [2](#tnote-2) | +| redis_server_search_fields__IndexErrors | The total number of indexing failures caused by attempts to index a document containing `` field. [1](#tnote-1) | +| redis_server_search_used_memory_indexes | The total memory allocated by all indexes in the shard in bytes. [1](#tnote-1) | +| redis_server_search_smallest_memory_index | The memory usage of the index with the smallest memory usage in the shard in bytes. [1](#tnote-1) | +| redis_server_search_largest_memory_index | The memory usage of the index with the largest memory usage in the shard in bytes. [1](#tnote-1) | +| redis_server_search_total_indexing_time | The total time spent on indexing operations, excluding the background indexing of vectors in the `HNSW` graph. [1](#tnote-1) | +| redis_server_search_used_memory_vector_index | The total memory usage of all vector indexes in the shard. [1](#tnote-1) | +| redis_server_search_global_idle | The total number of user and internal cursors currently holding pending results in the shard. [1](#tnote-1) | +| redis_server_search_global_total | The total number of user and internal cursors in the shard, either holding pending results or actively executing `FT.CURSOR READ`. [1](#tnote-1) | +| redis_server_search_bytes_collected | The total amount of memory freed by the garbage collectors from indexes in the shard memory in bytes. [1](#tnote-1) | +| redis_server_search_total_cycles | The total number of garbage collection cycles executed [1](#tnote-1) | +| redis_server_search_total_ms_run | The total duration of all garbage collection cycles in the shard, measured in milliseconds. [1](#tnote-1) | +| redis_server_search_total_docs_not_collected_by_gc | The number of documents marked as deleted whose memory has not yet been freed by the garbage collector. [1](#tnote-1) | +| redis_server_search_marked_deleted_vectors | The number of vectors marked as deleted in the vector indexes that have not yet been cleaned. [1](#tnote-1) | +| redis_server_search_total_queries_processed | The total number of successful query executions (When using cursors, not counting reading from existing cursors) in the shard. [1](#tnote-1) | +| redis_server_search_total_query_commands | The total number of successful query command executions (including `FT.SEARCH`, `FT.AGGREGATE`, and `FT.CURSOR READ`). [1](#tnote-1) | +| redis_server_search_total_query_execution_time_ms | The cumulative execution time of all query commands, including `FT.SEARCH`, `FT.AGGREGATE`, and `FT.CURSOR READ`, measured in ms. [1](#tnote-1) | +| redis_server_search_total_active_queries | The total number of background queries currently being executed in the shard, excluding `FT.CURSOR READ`. [1](#tnote-1) | +| redis_server_search_errors_indexing_failures | The total number of indexing failures recorded across all indexes in the shard. [1](#tnote-1) | +| redis_server_search_errors_for_index_with_max_failures | The number of indexing failures in the index with the highest count of failures. [1](#tnote-1) | + +1. Available since RediSearch 2.6. +2. Available since RediSearch 2.8. \ No newline at end of file diff --git a/content/integrate/prometheus-with-redis-enterprise/_index.md b/content/integrate/prometheus-with-redis-enterprise/_index.md index d06cf30e9e..48fa630884 100644 --- a/content/integrate/prometheus-with-redis-enterprise/_index.md +++ b/content/integrate/prometheus-with-redis-enterprise/_index.md @@ -12,6 +12,7 @@ summary: You can use Prometheus and Grafana to collect and visualize your Redis Software metrics. type: integration weight: 5 +tocEmbedHeaders: true --- You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. diff --git a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md index bcb3312d0f..91d02ba868 100644 --- a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md +++ b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md @@ -11,6 +11,7 @@ linkTitle: Prometheus metrics v2 summary: V2 metrics available to Prometheus as of Redis Enterprise Software version 7.8.2. type: integration weight: 50 +tocEmbedHeaders: true --- {{}} @@ -21,202 +22,4 @@ You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}). -## Database metrics - -| Metric | Type | Description | -| :-------- | :--- | :---------- | -| endpoint_client_connections | counter | Number of client connection establishment events | -| endpoint_client_disconnections | counter | Number of client disconnections initiated by the client | -| endpoint_client_connection_expired | counter | Total number of client connections with expired TTL (Time To Live) | -| endpoint_client_establishment_failures | counter | Number of client connections that failed to establish properly | -| endpoint_client_expiration_refresh | counter | Number of expiration time changes of clients | -| endpoint_client_tracking_off_requests | counter | Total number of `CLIENT TRACKING OFF` requests | -| endpoint_client_tracking_on_requests | counter | Total number of `CLIENT TRACKING ON` requests | -| endpoint_disconnected_cba_client | counter | Number of certificate-based clients disconnected | -| endpoint_disconnected_ldap_client | counter | Number of LDAP clients disconnected | -| endpoint_disconnected_user_password_client | counter | Number of user&password clients disconnected | -| endpoint_disposed_commands_after_client_caching | counter | Total number of client caching commands that were disposed due to misuse | -| endpoint_egress | counter | Number of egress bytes | -| endpoint_egress_pending | counter | Number of send-pending bytes | -| endpoint_egress_pending_discarded | counter | Number of send-pending bytes that were discarded due to disconnection | -| endpoint_failed_cba_authentication | counter | Number of clients that failed certificate-based authentication | -| endpoint_failed_ldap_authentication | counter | Number of clients that failed LDAP authentication | -| endpoint_failed_user_password_authentication | counter | Number of clients that failed user password authentication | -| endpoint_ingress | counter | Number of ingress bytes | -| endpoint_longest_pipeline_histogram | counter | Tracks the distribution of longest observed pipeline lengths, where a pipeline is a sequence of client commands sent without waiting for responses. | -| endpoint_other_requests | counter | Number of other requests | -| endpoint_other_requests_latency_histogram | histogram | Latency (in µs) histogram of other commands | -| endpoint_other_requests_latency_histogram_bucket | histogram | Latency histograms for commands other than read or write commands. Can be used to represent different latency percentiles.
p99.9 example:
`histogram_quantile(0.999, sum(rate(endpoint_other_requests_latency_histogram_bucket{cluster="$cluster", db="$db"}[$__rate_interval]) ) by (le, db))` | -| endpoint_other_responses | counter | Number of other responses | -| endpoint_proxy_disconnections | counter | Number of client disconnections initiated by the proxy | -| endpoint_read_requests | counter | Number of read requests | -| endpoint_read_requests_latency_histogram | histogram | Latency (in µs) histogram of read commands | -| endpoint_read_requests_latency_histogram_bucket | histogram | Latency histograms for read commands. Can be used to represent different latency percentiles.
p99.9 example:
`histogram_quantile(0.999, sum(rate(endpoint_read_requests_latency_histogram_bucket{cluster="$cluster", db="$db"}[$__rate_interval]) ) by (le, db))` | -| endpoint_read_responses | counter | Number of read responses | -| endpoint_successful_cba_authentication | counter | Number of clients that successfully authenticated with certificate-based authentication | -| endpoint_successful_ldap_authentication | counter | Number of clients that successfully authenticated with LDAP | -| endpoint_successful_user_password_authentication | counter | Number of clients that successfully authenticated with user&password | -| endpoint_write_requests | counter | Number of write requests | -| endpoint_write_requests_latency_histogram | histogram | Latency (in µs) histogram of write commands | -| endpoint_write_requests_latency_histogram_bucket | histogram | Latency histograms for write commands. Can be used to represent different latency percentiles.
p99.9 example:
`histogram_quantile(0.999, sum(rate(endpoint_write_requests_latency_histogram_bucket{cluster="$cluster", db="$db"}[$__rate_interval]) ) by (le, db))` | -| endpoint_write_responses | counter | Number of write responses | - -## Node metrics - -| Metric | Type |Description | -| :-------- | :--- | :---------- | -| node_available_flash_bytes | gauge | Available flash in the node (bytes) | -| node_available_flash_no_overbooking_bytes | gauge | Available flash in the node (bytes), without taking into account overbooking | -| node_available_memory_bytes | gauge | Amount of free memory in the node (bytes) that is available for database provisioning | -| node_available_memory_no_overbooking_bytes | gauge | Available RAM in the node (bytes) without taking into account overbooking | -| node_bigstore_free_bytes | gauge | Sum of free space of back-end flash (used by flash database's [BigRedis]) on all cluster nodes (bytes); returned only when BigRedis is enabled | -| node_cert_expires_in_seconds | gauge | Certificate expiration (in seconds) per given node; read more about [certificates in Redis Enterprise]({{< relref "/operate/rs/security/certificates" >}}) and [monitoring certificates]({{< relref "/operate/rs/security/certificates/monitor-certificates" >}}) | -| node_ephemeral_storage_avail_bytes | gauge | Disk space available to RLEC processes on configured ephemeral disk (bytes) | -| node_ephemeral_storage_free_bytes | gauge | Free disk space on configured ephemeral disk (bytes) | -| node_memory_MemFree_bytes | gauge | Free memory in the node (bytes) | -| node_persistent_storage_avail_bytes | gauge | Disk space available to RLEC processes on configured persistent disk (bytes) | -| node_persistent_storage_free_bytes | gauge | Free disk space on configured persistent disk (bytes) | -| node_provisional_flash_bytes | gauge | Amount of flash available for new shards on this node, taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | -| node_provisional_flash_no_overbooking_bytes | gauge | Amount of flash available for new shards on this node, without taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | -| node_provisional_memory_bytes | gauge | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases | -| node_provisional_memory_no_overbooking_bytes | gauge | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases, without taking into account overbooking | -| node_metrics_up | gauge | Node is part of the cluster and is connected | - -## Cluster metrics - -| Metric | Type | Description | -| :-------- | :--- | :---------- | -| generation{cluster_wd=} | gauge| Generation number of the specific cluster_wd| -| has_qourum{cluster_wd=, has_witness_disk=BOOL} | gauge| Has_qourum = 1
No quorum = 0 | -| is_primary{cluster_wd=} | gauge| primary = 1
secondary = 0 | -| license_shards_limit | gauge | Total shard limit by the license by shard type (ram / flash) | -| total_live_nodes_count{cluster_wd=} | gauge| Number of live nodes| -| total_node_count{cluster_wd=} | gauge| Number of nodes | -| total_primary_selection_ended{cluster_wd=} | counter | Monotonic counter for each selection process that ended | -| total_primary_selections{cluster_wd=} | counter | Monotonic counter for each selection process that started| - -## Replication metrics - -| Metric | Type | Description | -| :-------- | :--- | :---------- | -| database_syncer_config | gauge | Used as a placeholder for configuration labels | -| database_syncer_current_status | gauge | Syncer status for traffic; 0 = in-sync, 2 = out of sync | -| database_syncer_dst_connectivity_state | gauge | Destination connectivity state | -| database_syncer_dst_connectivity_state_ms | gauge | Destination connectivity state duration | -| database_syncer_dst_lag | gauge | Lag in milliseconds between the syncer and the destination | -| database_syncer_dst_repl_offset | gauge | Offset of the last command acknowledged | -| database_syncer_flush_counter | gauge | Number of destination flushes | -| database_syncer_ingress_bytes | gauge | Number of bytes read from source shard | -| database_syncer_ingress_bytes_decompressed | gauge | Number of bytes read from source shard | -| database_syncer_internal_state | gauge | Internal state of the syncer | -| database_syncer_lag_ms | gauge | Lag time between the source and the destination for traffic in milliseconds | -| database_syncer_rdb_size | gauge | The source's RDB size in bytes to be transferred during the syncing phase | -| database_syncer_rdb_transferred | gauge | Number of bytes transferred from the source's RDB during the syncing phase | -| database_syncer_src_connectivity_state | gauge | Source connectivity state | -| database_syncer_src_connectivity_state_ms | gauge | Source connectivity state duration | -| database_syncer_src_repl_offset | gauge | Last known source offset | -| database_syncer_state | gauge | Internal state of the shard syncer | -| database_syncer_syncer_repl_offset | gauge | Offset of the last command handled by the syncer | -| database_syncer_total_requests | gauge | Number of destination writes | -| database_syncer_total_responses | gauge | Number of destination writes acknowledged | - -## Shard metrics - -| Metric | Description | -| :-------- | :---------- | -| redis_server_active_defrag_running | Automatic memory defragmentation current aggressiveness (% cpu) | -| redis_server_allocator_active | Total used memory, including external fragmentation | -| redis_server_allocator_allocated | Total allocated memory | -| redis_server_allocator_resident | Total resident memory (RSS) | -| redis_server_aof_last_cow_size | Last AOFR, CopyOnWrite memory | -| redis_server_aof_rewrite_in_progress | The number of simultaneous AOF rewrites that are in progress | -| redis_server_aof_rewrites | Number of AOF rewrites this process executed | -| redis_server_aof_delayed_fsync | Number of times an AOF fsync caused delays in the main Redis thread (inducing latency); this can indicate that the disk is slow or overloaded | -| redis_server_blocked_clients | Count the clients waiting on a blocking call | -| redis_server_connected_clients | Number of client connections to the specific shard | -| redis_server_connected_slaves | Number of connected replicas | -| redis_server_db0_avg_ttl | Average TTL of all volatile keys | -| redis_server_expired_keys | Total count of volatile keys | -| redis_server_db0_keys | Total key count | -| redis_server_evicted_keys | Keys evicted so far (since restart) | -| redis_server_expire_cycle_cpu_milliseconds | The cumulative amount of time spent on active expiry cycles | -| redis_server_expired_keys | Keys expired so far (since restart) | -| redis_server_forwarding_state | Shard forwarding state (on or off) | -| redis_server_keys_trimmed | The number of keys that were trimmed in the current or last resharding process | -| redis_server_keyspace_read_hits | Number of read operations accessing an existing keyspace | -| redis_server_keyspace_read_misses | Number of read operations accessing a non-existing keyspace | -| redis_server_keyspace_write_hits | Number of write operations accessing an existing keyspace | -| redis_server_keyspace_write_misses | Number of write operations accessing a non-existing keyspace | -| redis_server_master_link_status | Indicates if the replica is connected to its master | -| redis_server_master_repl_offset | Number of bytes sent to replicas by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_server_master_sync_in_progress | The primary shard is synchronizing (1 true; 0 false) | -| redis_server_max_process_mem | Current memory limit configured by redis_mgr according to node free memory | -| redis_server_maxmemory | Current memory limit configured by redis_mgr according to database memory limits | -| redis_server_mem_aof_buffer | Current size of AOF buffer | -| redis_server_mem_clients_normal | Current memory used for input and output buffers of non-replica clients | -| redis_server_mem_clients_slaves | Current memory used for input and output buffers of replica clients | -| redis_server_mem_fragmentation_ratio | Memory fragmentation ratio (1.3 means 30% overhead) | -| redis_server_mem_not_counted_for_evict | Portion of used_memory (in bytes) that's not counted for eviction and OOM error | -| redis_server_mem_replication_backlog | Size of replication backlog | -| redis_server_module_fork_in_progress | A binary value that indicates if there is an active fork spawned by a module (1) or not (0) | -| namedprocess_namegroup_cpu_seconds_total | Shard process CPU usage in seconds | -| namedprocess_namegroup_thread_cpu_seconds_total | Shard main thread CPU time spent in seconds | -| namedprocess_namegroup_open_filedesc | Shard number of open file descriptors | -| namedprocess_namegroup_memory_bytes | Shard memory size in bytes | -| namedprocess_namegroup_oldest_start_time_seconds | Shard start time of the process since unix epoch in seconds | -| redis_server_rdb_bgsave_in_progress | Indication if bgsave is currently in progress | -| redis_server_rdb_last_cow_size | Last bgsave (or SYNC fork) used CopyOnWrite memory | -| redis_server_rdb_saves | Total count of bgsaves since the process was restarted (including replica fullsync and persistence) | -| redis_server_repl_touch_bytes | Number of bytes sent to replicas as TOUCH commands by the shard as a result of a READ command that was processed; calculate the throughput for a time period by comparing the value at different times | -| redis_server_total_commands_processed | Number of commands processed by the shard; calculate the number of commands for a time period by comparing the value at different times | -| redis_server_total_connections_received | Number of connections received by the shard; calculate the number of connections for a time period by comparing the value at different times | -| redis_server_total_net_input_bytes | Number of bytes received by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_server_total_net_output_bytes | Number of bytes sent by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_server_up | Shard is up and running | -| redis_server_used_memory | Memory used by shard (in BigRedis this includes flash) (bytes) | -| redis_server_search_number_of_indexes | Total number of indexes in the shard [1](#tnote-1) | -| redis_server_search_number_of_active_indexes | The total number of indexes running a background indexing and/or background query processing operation. Background indexing refers to vector ingestion process, or in-progress background indexer. [1](#tnote-1) | -| redis_server_search_number_of_active_indexes_running_queries | Total count of indexes currently running a background query process. [1](#tnote-1) | -| redis_server_search_number_of_active_indexes_indexing | Total count of indexes currently undergoing a background indexing process. Background indexing refers to vector ingestion process, or in-progress background indexer. This metric is limited by the number of WORKER threads allocated for writing operations + the number of indexes. [1](#tnote-1) | -| redis_server_search_total_active_write_threads | Total count of background write (indexing) processes currently running in the shard. Background indexing refers to vector ingestion process, or in-progress background indexer. This metric is limited by the number of threads allocated for writing operations. [1](#tnote-1) | -| redis_server_search_fields_text_Text | The total number of `TEXT` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_text_Sortable | The total number of `SORTABLE TEXT` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_text_NoIndex | The total number of `NOINDEX TEXT` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_numeric_Numeric | The total number of `NUMERIC` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_numeric_Sortable | The total number of `SORTABLE NUMERIC` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_numeric_NoIndex | The total number of `NOINDEX NUMERIC` fields across all indexes in the shard, which are used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_tag_Tag | The total number of `TAG` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_tag_Sortable | The total number of `SORTABLE TAG` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_tag_NoIndex | The total number of `NOINDEX TAG` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_tag_CaseSensitive | The total number of `CASESENSITIVE TAG` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_geo_Geo | The total number of `GEO` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_geo_Sortable | The total number of `SORTABLE GEO` fields across all indexes in the shard. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_geo_NoIndex | The total number of `NOINDEX GEO` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [1](#tnote-1) | -| redis_server_search_fields_vector_Vector | The total number of `VECTOR` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_vector_Flat | The total number of `FLAT VECTOR` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_vector_HNSW | The total number of `HNSW VECTOR` fields across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_fields_geoshape_Geoshape | The total number of `GEOSHAPE` fields across all indexes in the shard. [2](#tnote-2) | -| redis_server_search_fields_geoshape_Sortable | The total number of `SORTABLE GEOSHAPE` fields across all indexes in the shard. This field appears only if its value is larger than 0. [2](#tnote-2) | -| redis_server_search_fields_geoshape_NoIndex | The total number of `NOINDEX GEOSHAPE` fields across all indexes in the shard; i.e., used for sorting only but not indexed. This field appears only if its value is larger than 0. [2](#tnote-2) | -| redis_server_search_fields__IndexErrors | The total number of indexing failures caused by attempts to index a document containing `` field. [1](#tnote-1) | -| redis_server_search_used_memory_indexes | The total memory allocated by all indexes in the shard in bytes. [1](#tnote-1) | -| redis_server_search_smallest_memory_index | The memory usage of the index with the smallest memory usage in the shard in bytes. [1](#tnote-1) | -| redis_server_search_largest_memory_index | The memory usage of the index with the largest memory usage in the shard in bytes. [1](#tnote-1) | -| redis_server_search_total_indexing_time | The total time spent on indexing operations, excluding the background indexing of vectors in the `HNSW` graph. [1](#tnote-1) | -| redis_server_search_used_memory_vector_index | The total memory usage of all vector indexes in the shard. [1](#tnote-1) | -| redis_server_search_global_idle | The total number of user and internal cursors currently holding pending results in the shard. [1](#tnote-1) | -| redis_server_search_global_total | The total number of user and internal cursors in the shard, either holding pending results or actively executing `FT.CURSOR READ`. [1](#tnote-1) | -| redis_server_search_bytes_collected | The total amount of memory freed by the garbage collectors from indexes in the shard memory in bytes. [1](#tnote-1) | -| redis_server_search_total_cycles | The total number of garbage collection cycles executed [1](#tnote-1) | -| redis_server_search_total_ms_run | The total duration of all garbage collection cycles in the shard, measured in milliseconds. [1](#tnote-1) | -| redis_server_search_total_docs_not_collected_by_gc | The number of documents marked as deleted whose memory has not yet been freed by the garbage collector. [1](#tnote-1) | -| redis_server_search_marked_deleted_vectors | The number of vectors marked as deleted in the vector indexes that have not yet been cleaned. [1](#tnote-1) | -| redis_server_search_total_queries_processed | The total number of successful query executions (When using cursors, not counting reading from existing cursors) in the shard. [1](#tnote-1) | -| redis_server_search_total_query_commands | The total number of successful query command executions (including `FT.SEARCH`, `FT.AGGREGATE`, and `FT.CURSOR READ`). [1](#tnote-1) | -| redis_server_search_total_query_execution_time_ms | The cumulative execution time of all query commands, including `FT.SEARCH`, `FT.AGGREGATE`, and `FT.CURSOR READ`, measured in ms. [1](#tnote-1) | -| redis_server_search_total_active_queries | The total number of background queries currently being executed in the shard, excluding `FT.CURSOR READ`. [1](#tnote-1) | -| redis_server_search_errors_indexing_failures | The total number of indexing failures recorded across all indexes in the shard. [1](#tnote-1) | -| redis_server_search_errors_for_index_with_max_failures | The number of indexing failures in the index with the highest count of failures. [1](#tnote-1) | - -1. Available since RediSearch 2.6. -2. Available since RediSearch 2.8. \ No newline at end of file +{{}} diff --git a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1-to-v2.md b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1-to-v2.md index f5e16348b3..3c50929277 100644 --- a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1-to-v2.md +++ b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1-to-v2.md @@ -11,269 +11,11 @@ linkTitle: Transition from Prometheus v1 to v2 summary: Transition from v1 metrics to v2 PromQL equivalents. type: integration weight: 49 +tocEmbedHeaders: true --- You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. As of Redis Enterprise Software version 7.8.2, [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/) metrics are available. V1 metrics are deprecated but still available. You can use the following tables to transition from v1 metrics to equivalent v2 PromQL. For a list of all available v2 PromQL metrics, see [Prometheus metrics v2]({{}}). -## Database metrics - -| V1 metric | Equivalent V2 PromQL | Description | -| --------- | :------------------- | :---------- | -| bdb_avg_latency | `sum by (db) (irate(endpoint_acc_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Average latency of operations on the database (seconds); returned only when there is traffic | -| bdb_avg_latency_max | `sum by (db) (irate(endpoint_acc_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Highest value of average latency of operations on the database (seconds); returned only when there is traffic | -| bdb_avg_read_latency | `sum by (db) (irate(endpoint_acc_read_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Average latency of read operations (seconds); returned only when there is traffic | -| bdb_avg_read_latency_max
| `sum by (db) (irate(endpoint_acc_read_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Highest value of average latency of read operations (seconds); returned only when there is traffic | -| bdb_avg_write_latency | `sum by (db) (irate(endpoint_acc_write_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Average latency of write operations (seconds); returned only when there is traffic | -| bdb_avg_write_latency_max | `sum by (db) (irate(endpoint_acc_write_latency[1m])) / sum by (db) (irate(endpoint_total_started_res[1m])) / 1000000` | Highest value of average latency of write operations (seconds); returned only when there is traffic | -| bdb_bigstore_shard_count | `sum((sum(label_replace(label_replace(namedprocess_namegroup_thread_count{groupname=~"redis-\d+", threadname=~"(speedb\|rocksdb).*"}, "redis", "$1", "groupname", "redis-(\d+)"), "driver", "$1", "threadname", "(speedb\|rocksdb).*")) by (redis, driver) > bool 0) * on (redis) group_left(db) redis_server_up) by (db, driver)` | Shard count by database and by storage engine (driver - rocksdb / speedb); Only for databases with Auto Tiering enabled | -| bdb_conns | `sum by(db) (endpoint_client_connections)` | Number of client connections to database | -| bdb_egress_bytes | `sum by(db) (irate(endpoint_egress_bytes[1m]))` | Rate of outgoing network traffic from the database (bytes/sec) | -| bdb_egress_bytes_max | `sum by(db) (irate(endpoint_egress_bytes[1m]))` | Highest value of the rate of outgoing network traffic from the database (bytes/sec) | -| bdb_evicted_objects | `sum by (db) (irate(redis_server_evicted_keys{role="master"}[1m]))` | Rate of key evictions from database (evictions/sec) | -| bdb_evicted_objects_max | `sum by (db) (irate(redis_server_evicted_keys{role="master"}[1m]))` | Highest value of the rate of key evictions from database (evictions/sec) | -| bdb_expired_objects | `sum by (db) (irate(redis_server_expired_keys{role="master"}[1m]))` | Rate keys expired in database (expirations/sec) | -| bdb_expired_objects_max | `sum by (db) (irate(redis_server_expired_keys{role="master"}[1m]))` | Highest value of the rate keys expired in database (expirations/sec) | -| bdb_fork_cpu_system | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system"}[1m]))` | % cores utilization in system mode for all Redis shard fork child processes of this database | -| bdb_fork_cpu_system_max | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system"}[1m]))` | Highest value of % cores utilization in system mode for all Redis shard fork child processes of this database | -| bdb_fork_cpu_user | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user"}[1m]))` | % cores utilization in user mode for all Redis shard fork child processes of this database | -| bdb_fork_cpu_user_max | `sum by (db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user"}[1m]))` | Highest value of % cores utilization in user mode for all Redis shard fork child processes of this database | -| bdb_ingress_bytes | `sum by(db) (irate(endpoint_ingress_bytes[1m]))` | Rate of incoming network traffic to database (bytes/sec) | -| bdb_ingress_bytes_max | `sum by(db) (irate(endpoint_ingress_bytes[1m]))` | Highest value of the rate of incoming network traffic to database (bytes/sec) | -| bdb_instantaneous_ops_per_sec | `sum by(db) (redis_server_instantaneous_ops_per_sec)` | Request rate handled by all shards of database (ops/sec) | -| bdb_main_thread_cpu_system | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", threadname=~"redis-server.*"}[1m]))` | % cores utilization in system mode for all Redis shard main threads of this database | -| bdb_main_thread_cpu_system_max | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", threadname=~"redis-server.*"}[1m]))` | Highest value of % cores utilization in system mode for all Redis shard main threads of this database | -| bdb_main_thread_cpu_user | `sum by(irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", threadname=~"redis-server.*"}[1m]))` | % cores utilization in user mode for all Redis shard main threads of this database | -| bdb_main_thread_cpu_user_max | `sum by(irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", threadname=~"redis-server.*"}[1m]))` | Highest value of % cores utilization in user mode for all Redis shard main threads of this database | -| bdb_mem_frag_ratio | `avg(redis_server_mem_fragmentation_ratio)` | RAM fragmentation ratio (RSS / allocated RAM) | -| bdb_mem_size_lua | `sum by(db) (redis_server_used_memory_lua)` | Redis lua scripting heap size (bytes) | -| bdb_memory_limit | `sum by(db) (redis_server_maxmemory)` | Configured RAM limit for the database | -| bdb_monitor_sessions_count | `sum by(db) (endpoint_monitor_sessions_count)` | Number of clients connected in monitor mode to the database | -| bdb_no_of_keys | `sum by (db) (redis_server_db_keys{role="master"})` | Number of keys in database | -| bdb_other_req | `sum by(db) (irate(endpoint_other_req[1m]))` | Rate of other (non read/write) requests on the database (ops/sec) | -| bdb_other_req_max | `sum by(db) (irate(endpoint_other_req[1m]))` | Highest value of the rate of other (non read/write) requests on the database (ops/sec) | -| bdb_other_res | `sum by(db) (irate(endpoint_other_res[1m]))` | Rate of other (non read/write) responses on the database (ops/sec) | -| bdb_other_res_max | `sum by(db) (irate(endpoint_other_res[1m]))` | Highest value of the rate of other (non read/write) responses on the database (ops/sec) | -| bdb_pubsub_channels | `sum by(db) (redis_server_pubsub_channels)` | Count the pub/sub channels with subscribed clients | -| bdb_pubsub_channels_max | `sum by(db) (redis_server_pubsub_channels)` | Highest value of count the pub/sub channels with subscribed clients | -| bdb_pubsub_patterns | `sum by(db) (redis_server_pubsub_patterns)` | Count the pub/sub patterns with subscribed clients | -| bdb_pubsub_patterns_max | `sum by(db) (redis_server_pubsub_patterns)` | Highest value of count the pub/sub patterns with subscribed clients | -| bdb_read_hits | `sum by (db) (irate(redis_server_keyspace_read_hits{role="master"}[1m]))` | Rate of read operations accessing an existing key (ops/sec) | -| bdb_read_hits_max | `sum by (db) (irate(redis_server_keyspace_read_hits{role="master"}[1m]))` | Highest value of the rate of read operations accessing an existing key (ops/sec) | -| bdb_read_misses | `sum by (db) (irate(redis_server_keyspace_read_misses{role="master"}[1m]))` | Rate of read operations accessing a non-existing key (ops/sec) | -| bdb_read_misses_max | `sum by (db) (irate(redis_server_keyspace_read_misses{role="master"}[1m]))` | Highest value of the rate of read operations accessing a non-existing key (ops/sec) | -| bdb_read_req | `sum by (db) (irate(endpoint_read_req[1m]))` | Rate of read requests on the database (ops/sec) | -| bdb_read_req_max | `sum by (db) (irate(endpoint_read_req[1m]))` | Highest value of the rate of read requests on the database (ops/sec) | -| bdb_read_res | `sum by(db) (irate(endpoint_read_res[1m]))` | Rate of read responses on the database (ops/sec) | -| bdb_read_res_max | `sum by(db) (irate(endpoint_read_res[1m]))` | Highest value of the rate of read responses on the database (ops/sec) | -| bdb_shard_cpu_system | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", role="master"}[1m]))` | % cores utilization in system mode for all Redis shard processes of this database | -| bdb_shard_cpu_system_max | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="system", role="master"}[1m]))` | Highest value of % cores utilization in system mode for all Redis shard processes of this database | -| bdb_shard_cpu_user | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", role="master"}[1m]))` | % cores utilization in user mode for the Redis shard process | -| bdb_shard_cpu_user_max | `sum by(db) (irate(namedprocess_namegroup_thread_cpu_seconds_total{mode="user", role="master"}[1m]))` | Highest value of % cores utilization in user mode for the Redis shard process | -| bdb_shards_used | `sum((sum(label_replace(label_replace(label_replace(namedprocess_namegroup_thread_count{groupname=~"redis-\d+"}, "redis", "$1", "groupname", "redis-(\d+)"), "shard_type", "flash", "threadname", "(bigstore).*"), "shard_type", "ram", "shard_type", "")) by (redis, shard_type) > bool 0) * on (redis) group_left(db) redis_server_up) by (db, shard_type)` | Used shard count by database and by shard type (ram / flash) | -| bdb_total_connections_received | `sum by(db) (irate(endpoint_total_connections_received[1m]))` | Rate of new client connections to database (connections/sec) | -| bdb_total_connections_received_max | `sum by(db) (irate(endpoint_total_connections_received[1m]))` | Highest value of the rate of new client connections to database (connections/sec) | -| bdb_total_req | `sum by (db) (irate(endpoint_total_req[1m]))` | Rate of all requests on the database (ops/sec) | -| bdb_total_req_max | `sum by (db) (irate(endpoint_total_req[1m]))` | Highest value of the rate of all requests on the database (ops/sec) | -| bdb_total_res | `sum by(db) (irate(endpoint_total_res[1m]))` | Rate of all responses on the database (ops/sec) | -| bdb_total_res_max | `sum by(db) (irate(endpoint_total_res[1m]))` | Highest value of the rate of all responses on the database (ops/sec) | -| bdb_up | `min by(db) (redis_up)` | Database is up and running | -| bdb_used_memory | `sum by (db) (redis_server_used_memory)` | Memory used by database (in BigRedis this includes flash) (bytes) | -| bdb_write_hits | `sum by (db) (irate(redis_server_keyspace_write_hits{role="master"}[1m]))` | Rate of write operations accessing an existing key (ops/sec) | -| bdb_write_hits_max | `sum by (db) (irate(redis_server_keyspace_write_hits{role="master"}[1m]))` | Highest value of the rate of write operations accessing an existing key (ops/sec) | -| bdb_write_misses | `sum by (db) (irate(redis_server_keyspace_write_misses{role="master"}[1m]))` | Rate of write operations accessing a non-existing key (ops/sec) | -| bdb_write_misses_max | `sum by (db) (irate(redis_server_keyspace_write_misses{role="master"}[1m]))` | Highest value of the rate of write operations accessing a non-existing key (ops/sec) | -| bdb_write_req | `sum by (db) (irate(endpoint_write_requests[1m]))` | Rate of write requests on the database (ops/sec) | -| bdb_write_req_max | `sum by (db) (irate(endpoint_write_requests[1m]))` | Highest value of the rate of write requests on the database (ops/sec) | -| bdb_write_res | `sum by(db) (irate(endpoint_write_responses[1m]))` | Rate of write responses on the database (ops/sec) | -| bdb_write_res_max | `sum by(db) (irate(endpoint_write_responses[1m]))` | Highest value of the rate of write responses on the database (ops/sec) | -| no_of_expires | `sum by(db) (redis_server_db_expires{role="master"})` | Current number of volatile keys in the database | - -## Node metrics - -| V1 metric | Equivalent V2 PromQL | Description | -| --------- | :------------------- | :---------- | -| node_available_flash | `node_available_flash_bytes` | Available flash in the node (bytes) | -| node_available_flash_no_overbooking | `node_available_flash_no_overbooking_bytes` | Available flash in the node (bytes), without taking into account overbooking | -| node_available_memory | `node_available_memory_bytes` | Amount of free memory in the node (bytes) that is available for database provisioning | -| node_available_memory_no_overbooking | `node_available_memory_no_overbooking_bytes` | Available RAM in the node (bytes) without taking into account overbooking | -| node_avg_latency | `sum by (proxy) (irate(endpoint_acc_latency[1m])) / sum by (proxy) (irate(endpoint_total_started_res[1m]))` | Average latency of requests handled by endpoints on the node in milliseconds; returned only when there is traffic | -| node_bigstore_free | `node_bigstore_free_bytes` | Sum of free space of back-end flash (used by flash database's [BigRedis]) on all cluster nodes (bytes); returned only when BigRedis is enabled | -| node_bigstore_iops | `node_flash_reads_total + node_flash_writes_total` | Rate of I/O operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | -| node_bigstore_kv_ops | `sum by (node) (irate(redis_server_big_io_dels[1m]) + irate(redis_server_big_io_reads[1m]) + irate(redis_server_big_io_writes[1m]))` | Rate of value read/write operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | -| node_bigstore_throughput | `sum by (node) (irate(redis_server_big_io_read_bytes[1m]) + irate(redis_server_big_io_write_bytes[1m]))` | Throughput I/O operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (bytes/sec); returned only when BigRedis is enabled | -| node_cert_expiration_seconds | `node_cert_expires_in_seconds` | Certificate expiration (in seconds) per given node; read more about [certificates in Redis Enterprise]({{< relref "/operate/rs/security/certificates" >}}) and [monitoring certificates]({{< relref "/operate/rs/security/certificates/monitor-certificates" >}}) | -| node_conns | `sum by (node) (endpoint_client_connections)` | Number of clients connected to endpoints on node | -| node_cpu_idle | `avg by (node) (irate(node_cpu_seconds_total{mode="idle"}[1m]))` | CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_idle_max | N/A | Highest value of CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_idle_median | N/A | Average value of CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_idle_min | N/A | Lowest value of CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_system | `avg by (node) (irate(node_cpu_seconds_total{mode="system"}[1m]))` | CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_system_max | N/A | Highest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_system_median | N/A | Average value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_system_min | N/A | Lowest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_user | `avg by (node) (irate(node_cpu_seconds_total{mode="user"}[1m]))` | CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | -| node_cpu_user_max | N/A | Highest value of CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | -| node_cpu_user_median | N/A | Average value of CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | -| node_cpu_user_min | N/A | Lowest value of CPU time portion spent by user-space processes (0-1, multiply by 100 to get percent) | -| node_cur_aof_rewrites | `sum by (cluster, node) (redis_server_aof_rewrite_in_progress)` | Number of AOF rewrites that are currently performed by shards on this node | -| node_egress_bytes | `irate(node_network_transmit_bytes_total{device=""}[1m])` | Rate of outgoing network traffic to node (bytes/sec) | -| node_egress_bytes_max | N/A | Highest value of the rate of outgoing network traffic to node (bytes/sec) | -| node_egress_bytes_median | N/A | Average value of the rate of outgoing network traffic to node (bytes/sec) | -| node_egress_bytes_min | N/A | Lowest value of the rate of outgoing network traffic to node (bytes/sec) | -| node_ephemeral_storage_avail | `node_ephemeral_storage_avail_bytes` | Disk space available to RLEC processes on configured ephemeral disk (bytes) | -| node_ephemeral_storage_free | `node_ephemeral_storage_free_bytes` | Free disk space on configured ephemeral disk (bytes) | -| node_free_memory | `node_memory_MemFree_bytes` | Free memory in the node (bytes) | -| node_ingress_bytes | `irate(node_network_receive_bytes_total{device=""}[1m])` | Rate of incoming network traffic to node (bytes/sec) | -| node_ingress_bytes_max | N/A | Highest value of the rate of incoming network traffic to node (bytes/sec) | -| node_ingress_bytes_median | N/A | Average value of the rate of incoming network traffic to node (bytes/sec) | -| node_ingress_bytes_min | N/A | Lowest value of the rate of incoming network traffic to node (bytes/sec) | -| node_persistent_storage_avail | `node_persistent_storage_avail_bytes` | Disk space available to RLEC processes on configured persistent disk (bytes) | -| node_persistent_storage_free | `node_persistent_storage_free_bytes` | Free disk space on configured persistent disk (bytes) | -| node_provisional_flash | `node_provisional_flash_bytes` | Amount of flash available for new shards on this node, taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | -| node_provisional_flash_no_overbooking | `node_provisional_flash_no_overbooking_bytes` | Amount of flash available for new shards on this node, without taking into account overbooking, max Redis servers, reserved flash, and provision and migration thresholds (bytes) | -| node_provisional_memory | `node_provisional_memory_bytes` | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases | -| node_provisional_memory_no_overbooking | `node_provisional_memory_no_overbooking_bytes` | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases, without taking into account overbooking | -| node_total_req | `sum by (cluster, node) (irate(endpoint_total_req[1m]))` | Request rate handled by endpoints on node (ops/sec) | -| node_up | `node_metrics_up` | Node is part of the cluster and is connected | - -## Cluster metrics - -| V1 metric | Equivalent V2 PromQL | Description | -| --------- | :------------------- | :---------- | -| cluster_shards_limit | `license_shards_limit` | Total shard limit by the license by shard type (ram / flash) | - -## Proxy metrics - -| V1 metric | Equivalent V2 PromQL | Description | -| --------- | :------------------- | :---------- | -| listener_acc_latency | N/A | Accumulative latency (sum of the latencies) of all types of commands on the database. For the average latency, divide this value by listener_total_res | -| listener_acc_latency_max | N/A | Highest value of accumulative latency of all types of commands on the database | -| listener_acc_other_latency | N/A | Accumulative latency (sum of the latencies) of commands that are a type "other" on the database. For the average latency, divide this value by listener_other_res | -| listener_acc_other_latency_max | N/A | Highest value of accumulative latency of commands that are a type "other" on the database | -| listener_acc_read_latency | N/A | Accumulative latency (sum of the latencies) of commands that are a type "read" on the database. For the average latency, divide this value by listener_read_res | -| listener_acc_read_latency_max | N/A | Highest value of accumulative latency of commands that are a type "read" on the database | -| listener_acc_write_latency | N/A | Accumulative latency (sum of the latencies) of commands that are a type "write" on the database. For the average latency, divide this value by listener_write_res | -| listener_acc_write_latency_max | N/A | Highest value of accumulative latency of commands that are a type "write" on the database | -| listener_auth_cmds | N/A | Number of memcached AUTH commands sent to the database | -| listener_auth_cmds_max | N/A | Highest value of the number of memcached AUTH commands sent to the database | -| listener_auth_errors | N/A | Number of error responses to memcached AUTH commands | -| listener_auth_errors_max | N/A | Highest value of the number of error responses to memcached AUTH commands | -| listener_cmd_flush | N/A | Number of memcached FLUSH_ALL commands sent to the database | -| listener_cmd_flush_max | N/A | Highest value of the number of memcached FLUSH_ALL commands sent to the database | -| listener_cmd_get | N/A | Number of memcached GET commands sent to the database | -| listener_cmd_get_max | N/A | Highest value of the number of memcached GET commands sent to the database | -| listener_cmd_set | N/A | Number of memcached SET commands sent to the database | -| listener_cmd_set_max | N/A | Highest value of the number of memcached SET commands sent to the database | -| listener_cmd_touch | N/A | Number of memcached TOUCH commands sent to the database | -| listener_cmd_touch_max | N/A | Highest value of the number of memcached TOUCH commands sent to the database | -| listener_conns | N/A | Number of clients connected to the endpoint | -| listener_egress_bytes | N/A | Rate of outgoing network traffic to the endpoint (bytes/sec) | -| listener_egress_bytes_max | N/A | Highest value of the rate of outgoing network traffic to the endpoint (bytes/sec) | -| listener_ingress_bytes | N/A | Rate of incoming network traffic to the endpoint (bytes/sec) | -| listener_ingress_bytes_max | N/A | Highest value of the rate of incoming network traffic to the endpoint (bytes/sec) | -| listener_last_req_time | N/A | Time of last command sent to the database | -| listener_last_res_time | N/A | Time of last response sent from the database | -| listener_max_connections_exceeded | `irate(endpoint_maximal_connections_exceeded[1m])` | Number of times the number of clients connected to the database at the same time has exceeded the max limit | -| listener_max_connections_exceeded_max | N/A | Highest value of the number of times the number of clients connected to the database at the same time has exceeded the max limit | -| listener_monitor_sessions_count | N/A | Number of clients connected in monitor mode to the endpoint | -| listener_other_req | N/A | Rate of other (non-read/write) requests on the endpoint (ops/sec) | -| listener_other_req_max | N/A | Highest value of the rate of other (non-read/write) requests on the endpoint (ops/sec) | -| listener_other_res | N/A | Rate of other (non-read/write) responses on the endpoint (ops/sec) | -| listener_other_res_max | N/A | Highest value of the rate of other (non-read/write) responses on the endpoint (ops/sec) | -| listener_other_started_res | N/A | Number of responses sent from the database of type "other" | -| listener_other_started_res_max | N/A | Highest value of the number of responses sent from the database of type "other" | -| listener_read_req | `irate(endpoint_read_requests[1m])` | Rate of read requests on the endpoint (ops/sec) | -| listener_read_req_max | N/A | Highest value of the rate of read requests on the endpoint (ops/sec) | -| listener_read_res | `irate(endpoint_read_responses[1m])` | Rate of read responses on the endpoint (ops/sec) | -| listener_read_res_max | N/A | Highest value of the rate of read responses on the endpoint (ops/sec) | -| listener_read_started_res | N/A | Number of responses sent from the database of type "read" | -| listener_read_started_res_max | N/A | Highest value of the number of responses sent from the database of type "read" | -| listener_total_connections_received | `irate(endpoint_total_connections_received[1m])` | Rate of new client connections to the endpoint (connections/sec) | -| listener_total_connections_received_max | N/A | Highest value of the rate of new client connections to the endpoint (connections/sec) | -| listener_total_req | N/A | Request rate handled by the endpoint (ops/sec) | -| listener_total_req_max | N/A | Highest value of the rate of all requests on the endpoint (ops/sec) | -| listener_total_res | N/A | Rate of all responses on the endpoint (ops/sec) | -| listener_total_res_max | N/A | Highest value of the rate of all responses on the endpoint (ops/sec) | -| listener_total_started_res | N/A | Number of responses sent from the database of all types | -| listener_total_started_res_max | N/A | Highest value of the number of responses sent from the database of all types | -| listener_write_req | `irate(endpoint_write_requests[1m])` | Rate of write requests on the endpoint (ops/sec) | -| listener_write_req_max | N/A | Highest value of the rate of write requests on the endpoint (ops/sec) | -| listener_write_res | `irate(endpoint_write_responses[1m])` | Rate of write responses on the endpoint (ops/sec) | -| listener_write_res_max | N/A | Highest value of the rate of write responses on the endpoint (ops/sec) | -| listener_write_started_res | N/A | Number of responses sent from the database of type "write" | -| listener_write_started_res_max | N/A | Highest value of the number of responses sent from the database of type "write" | - -## Replication metrics - -| V1 metric | Equivalent V2 PromQL | Description | -| --------- | :------------------- | :---------- | -| bdb_replicaof_syncer_ingress_bytes | `rate(replica_src_ingress_bytes[1m])` | Rate of compressed incoming network traffic to a Replica Of database (bytes/sec) | -| bdb_replicaof_syncer_ingress_bytes_decompressed | `rate(replica_src_ingress_bytes_decompressed[1m])` | Rate of decompressed incoming network traffic to a Replica Of database (bytes/sec) | -| bdb_replicaof_syncer_local_ingress_lag_time | `database_syncer_lag_ms{syncer_type="replicaof"}` | Lag time between the source and the destination for Replica Of traffic (ms) | -| bdb_replicaof_syncer_status | `database_syncer_current_status{syncer_type="replicaof"}` | Syncer status for Replica Of traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | -| bdb_crdt_syncer_ingress_bytes | `rate(crdt_src_ingress_bytes[1m])` | Rate of compressed incoming network traffic to CRDB (bytes/sec) | -| bdb_crdt_syncer_ingress_bytes_decompressed | `rate(crdt_src_ingress_bytes_decompressed[1m])` | Rate of decompressed incoming network traffic to CRDB (bytes/sec) | -| bdb_crdt_syncer_local_ingress_lag_time | `database_syncer_lag_ms{syncer_type="crdt"}` | Lag time between the source and the destination (ms) for CRDB traffic | -| bdb_crdt_syncer_status | `database_syncer_current_status{syncer_type="crdt"}` | Syncer status for CRDB traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | - -## Shard metrics - -| V1 metric | Equivalent V2 PromQL | Description | -| --------- | :------------------- | :---------- | -| redis_active_defrag_running | `redis_server_active_defrag_running` | Automatic memory defragmentation current aggressiveness (% cpu) | -| redis_allocator_active | `redis_server_allocator_active` | Total used memory, including external fragmentation | -| redis_allocator_allocated | `redis_server_allocator_allocated` | Total allocated memory | -| redis_allocator_resident | `redis_server_allocator_resident` | Total resident memory (RSS) | -| redis_aof_last_cow_size | `redis_server_aof_last_cow_size` | Last AOFR, CopyOnWrite memory | -| redis_aof_rewrite_in_progress | `redis_server_aof_rewrite_in_progress` | The number of simultaneous AOF rewrites that are in progress | -| redis_aof_rewrites | `redis_server_aof_rewrites` | Number of AOF rewrites this process executed | -| redis_aof_delayed_fsync | `redis_server_aof_delayed_fsync` | Number of times an AOF fsync caused delays in the main Redis thread (inducing latency); this can indicate that the disk is slow or overloaded | -| redis_blocked_clients | `redis_server_blocked_clients` | Count the clients waiting on a blocking call | -| redis_connected_clients | `redis_server_connected_clients` | Number of client connections to the specific shard | -| redis_connected_slaves | `redis_server_connected_slaves` | Number of connected replicas | -| redis_db0_avg_ttl | `redis_server_db0_avg_ttl` | Average TTL of all volatile keys | -| redis_db0_expires | `redis_server_expired_keys` | Total count of volatile keys | -| redis_db0_keys | `redis_server_db0_keys` | Total key count | -| redis_evicted_keys | `redis_server_evicted_keys` | Keys evicted so far (since restart) | -| redis_expire_cycle_cpu_milliseconds | `redis_server_expire_cycle_cpu_milliseconds` | The cumulative amount of time spent on active expiry cycles | -| redis_expired_keys | `redis_server_expired_keys` | Keys expired so far (since restart) | -| redis_forwarding_state | `redis_server_forwarding_state` | Shard forwarding state (on or off) | -| redis_keys_trimmed | `redis_server_keys_trimmed` | The number of keys that were trimmed in the current or last resharding process | -| redis_keyspace_read_hits | `redis_server_keyspace_read_hits` | Number of read operations accessing an existing keyspace | -| redis_keyspace_read_misses | `redis_server_keyspace_read_misses` | Number of read operations accessing a non-existing keyspace | -| redis_keyspace_write_hits | `redis_server_keyspace_write_hits` | Number of write operations accessing an existing keyspace | -| redis_keyspace_write_misses | `redis_server_keyspace_write_misses` | Number of write operations accessing a non-existing keyspace | -| redis_master_link_status | `redis_server_master_link_status` | Indicates if the replica is connected to its master | -| redis_master_repl_offset | `redis_server_master_repl_offset` | Number of bytes sent to replicas by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_master_sync_in_progress | `redis_server_master_sync_in_progress` | The master shard is synchronizing (1 true; 0 false) | -| redis_max_process_mem | `redis_server_max_process_mem` | Current memory limit configured by redis_mgr according to node free memory | -| redis_maxmemory | `redis_server_maxmemory` | Current memory limit configured by redis_mgr according to database memory limits | -| redis_mem_aof_buffer | `redis_server_mem_aof_buffer` | Current size of AOF buffer | -| redis_mem_clients_normal | `redis_server_mem_clients_normal` | Current memory used for input and output buffers of non-replica clients | -| redis_mem_clients_slaves | `redis_server_mem_clients_slaves` | Current memory used for input and output buffers of replica clients | -| redis_mem_fragmentation_ratio | `redis_server_mem_fragmentation_ratio` | Memory fragmentation ratio (1.3 means 30% overhead) | -| redis_mem_not_counted_for_evict | `redis_server_mem_not_counted_for_evict` | Portion of used_memory (in bytes) that's not counted for eviction and OOM error | -| redis_mem_replication_backlog | `redis_server_mem_replication_backlog` | Size of replication backlog | -| redis_module_fork_in_progress | `redis_server_module_fork_in_progress` | A binary value that indicates if there is an active fork spawned by a module (1) or not (0) | -| redis_process_cpu_system_seconds_total | `namedprocess_namegroup_cpu_seconds_total{mode="system"}` | Shard process system CPU time spent in seconds | -| redis_process_cpu_usage_percent | `namedprocess_namegroup_cpu_seconds_total{mode=~"system\|user"}` | Shard process CPU usage percentage | -| redis_process_cpu_user_seconds_total | `namedprocess_namegroup_cpu_seconds_total{mode="user"}` | Shard user CPU time spent in seconds | -| redis_process_main_thread_cpu_system_seconds_total | `namedprocess_namegroup_thread_cpu_seconds_total{mode="system",threadname="redis-server"}` | Shard main thread system CPU time spent in seconds | -| redis_process_main_thread_cpu_user_seconds_total | `namedprocess_namegroup_thread_cpu_seconds_total{mode="user",threadname="redis-server"}` | Shard main thread user CPU time spent in seconds | -| redis_process_max_fds | `max(namedprocess_namegroup_open_filedesc)` | Shard maximum number of open file descriptors | -| redis_process_open_fds | `namedprocess_namegroup_open_filedesc` | Shard number of open file descriptors | -| redis_process_resident_memory_bytes | `namedprocess_namegroup_memory_bytes{memtype="resident"}` | Shard resident memory size in bytes | -| redis_process_start_time_seconds | `namedprocess_namegroup_oldest_start_time_seconds` | Shard start time of the process since unix epoch in seconds | -| redis_process_virtual_memory_bytes | `namedprocess_namegroup_memory_bytes{memtype="virtual"}` | Shard virtual memory in bytes | -| redis_rdb_bgsave_in_progress | `redis_server_rdb_bgsave_in_progress` | Indication if bgsave is currently in progress | -| redis_rdb_last_cow_size | `redis_server_rdb_last_cow_size` | Last bgsave (or SYNC fork) used CopyOnWrite memory | -| redis_rdb_saves | `redis_server_rdb_saves` | Total count of bgsaves since the process was restarted (including replica fullsync and persistence) | -| redis_repl_touch_bytes | `redis_server_repl_touch_bytes` | Number of bytes sent to replicas as TOUCH commands by the shard as a result of a READ command that was processed; calculate the throughput for a time period by comparing the value at different times | -| redis_total_commands_processed | `redis_server_total_commands_processed` | Number of commands processed by the shard; calculate the number of commands for a time period by comparing the value at different times | -| redis_total_connections_received | `redis_server_total_connections_received` | Number of connections received by the shard; calculate the number of connections for a time period by comparing the value at different times | -| redis_total_net_input_bytes | `redis_server_total_net_input_bytes` | Number of bytes received by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_total_net_output_bytes | `redis_server_total_net_output_bytes` | Number of bytes sent by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_up | `redis_server_up` | Shard is up and running | -| redis_used_memory | `redis_server_used_memory` | Memory used by shard (in BigRedis this includes flash) (bytes) | +{{}} diff --git a/content/operate/rs/clusters/monitoring/_index.md b/content/operate/rs/monitoring/_index.md similarity index 98% rename from content/operate/rs/clusters/monitoring/_index.md rename to content/operate/rs/monitoring/_index.md index becff8195e..0a9cc7320d 100644 --- a/content/operate/rs/clusters/monitoring/_index.md +++ b/content/operate/rs/monitoring/_index.md @@ -9,8 +9,10 @@ categories: description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. hideListLinks: true linkTitle: Monitoring -weight: 96 +weight: 70 +aliases: /operate/rs/clusters/monitoring/ --- + You can use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to monitor the performance of your databases. In the Redis Enterprise Cluster Manager UI, you can view metrics, configure alerts, and send notifications based on alert parameters. You can also access metrics and configure alerts through the REST API. diff --git a/content/operate/rs/monitoring/metrics_stream_engine/_index.md b/content/operate/rs/monitoring/metrics_stream_engine/_index.md new file mode 100644 index 0000000000..7ea1e2cf41 --- /dev/null +++ b/content/operate/rs/monitoring/metrics_stream_engine/_index.md @@ -0,0 +1,15 @@ +--- +Title: Metrics stream engine preview +alwaysopen: false +categories: +- docs +- operate +- rs +- kubernetes +description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. +hideListLinks: true +linkTitle: Metrics stream engine - v2 monitoring preview +weight: 60 +--- + +TBA diff --git a/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md b/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md new file mode 100644 index 0000000000..3c50929277 --- /dev/null +++ b/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md @@ -0,0 +1,21 @@ +--- +Title: Transition from Prometheus v1 to Prometheus v2 +alwaysopen: false +categories: +- docs +- integrate +- rs +description: Transition from v1 metrics to v2 PromQL equivalents. +group: observability +linkTitle: Transition from Prometheus v1 to v2 +summary: Transition from v1 metrics to v2 PromQL equivalents. +type: integration +weight: 49 +tocEmbedHeaders: true +--- + +You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. + +As of Redis Enterprise Software version 7.8.2, [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/) metrics are available. V1 metrics are deprecated but still available. You can use the following tables to transition from v1 metrics to equivalent v2 PromQL. For a list of all available v2 PromQL metrics, see [Prometheus metrics v2]({{}}). + +{{}} diff --git a/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md b/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md new file mode 100644 index 0000000000..91d02ba868 --- /dev/null +++ b/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md @@ -0,0 +1,25 @@ +--- +Title: Prometheus metrics v2 preview +alwaysopen: false +categories: +- docs +- integrate +- rs +description: V2 metrics available to Prometheus as of Redis Enterprise Software version 7.8.2. +group: observability +linkTitle: Prometheus metrics v2 +summary: V2 metrics available to Prometheus as of Redis Enterprise Software version 7.8.2. +type: integration +weight: 50 +tocEmbedHeaders: true +--- + +{{}} +While the metrics stream engine is in preview, this document provides only a partial list of v2 metrics. More metrics will be added. +{{}} + +You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. + +The v2 metrics in the following tables are available as of Redis Enterprise Software version 7.8.0. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). + +{{}} diff --git a/content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md b/content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md new file mode 100644 index 0000000000..a5f4e8059c --- /dev/null +++ b/content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md @@ -0,0 +1,173 @@ +--- +LinkTitle: Prometheus and Grafana +Title: Prometheus and Grafana with Redis Enterprise Software +alwaysopen: false +categories: +- docs +- integrate +- rs +description: Use Prometheus and Grafana to collect and visualize Redis Enterprise Software metrics. +group: observability +summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise + Software metrics. +type: integration +weight: 5 +tocEmbedHeaders: true +--- + +You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. + +Metrics are exposed at the cluster, node, database, shard, and proxy levels. + + +- [Prometheus](https://prometheus.io/) is an open source systems monitoring and alerting toolkit that aggregates metrics from different sources. +- [Grafana](https://grafana.com/) is an open source metrics visualization tool that processes Prometheus data. + +You can use Prometheus and Grafana to: +- Collect and display metrics not available in the [admin console]({{< relref "/operate/rs/references/metrics" >}}) + +- Set up automatic alerts for node or cluster events + +- Display Redis Enterprise Software metrics alongside data from other systems + +{{Graphic showing how Prometheus and Grafana collect and display data from a Redis Enterprise Cluster. Prometheus collects metrics from the Redis Enterprise cluster, and Grafana queries those metrics for visualization.}} + +In each cluster, the metrics_exporter process exposes Prometheus metrics on port 8070. +Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. + +## Quick start + +To get started with Prometheus and Grafana: + +1. Create a directory called 'prometheus' on your local machine. + +1. Within that directory, create a configuration file called `prometheus.yml`. +1. Add the following contents to the configuration file and replace `` with your Redis Enterprise cluster's FQDN: + + {{< note >}} + +We recommend running Prometheus in Docker only for development and testing. + + {{< /note >}} + + ```yml + global: + scrape_interval: 15s + evaluation_interval: 15s + + # Attach these labels to any time series or alerts when communicating with + # external systems (federation, remote storage, Alertmanager). + external_labels: + monitor: "prometheus-stack-monitor" + + # Load and evaluate rules in this file every 'evaluation_interval' seconds. + #rule_files: + # - "first.rules" + # - "second.rules" + + scrape_configs: + # scrape Prometheus itself + - job_name: prometheus + scrape_interval: 10s + scrape_timeout: 5s + static_configs: + - targets: ["localhost:9090"] + + # scrape Redis Enterprise + - job_name: redis-enterprise + scrape_interval: 30s + scrape_timeout: 30s + metrics_path: / + scheme: https + tls_config: + insecure_skip_verify: true + static_configs: + - targets: [":8070"] # For v2, use [":8070/v2"] + ``` + +1. Set up your Prometheus and Grafana servers. + To set up Prometheus and Grafana on Docker: + 1. Create a _docker-compose.yml_ file: + + ```yml + version: '3' + services: + prometheus-server: + image: prom/prometheus + ports: + - 9090:9090 + volumes: + - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml + + grafana-ui: + image: grafana/grafana + ports: + - 3000:3000 + environment: + - GF_SECURITY_ADMIN_PASSWORD=secret + links: + - prometheus-server:prometheus + ``` + + 1. To start the containers, run: + + ```sh + $ docker compose up -d + ``` + + 1. To check that all of the containers are up, run: `docker ps` + 1. In your browser, sign in to Prometheus at http://localhost:9090 to make sure the server is running. + 1. Select **Status** and then **Targets** to check that Prometheus is collecting data from your Redis Enterprise cluster. + + {{The Redis Enterprise target showing that Prometheus is connected to the Redis Enterprise Cluster.}} + + If Prometheus is connected to the cluster, you can type **node_up** in the Expression field on the Prometheus home page to see the cluster metrics. + +1. Configure the Grafana datasource: + 1. Sign in to Grafana. If you installed Grafana locally, go to http://localhost:3000 and sign in with: + + - Username: admin + - Password: secret + + 1. In the Grafana configuration menu, select **Data Sources**. + + 1. Select **Add data source**. + + 1. Select **Prometheus** from the list of data source types. + + {{The Prometheus data source in the list of data sources on Grafana.}} + + 1. Enter the Prometheus configuration information: + + - Name: `redis-enterprise` + - URL: `http://:9090` + + {{The Prometheus connection form in Grafana.}} + + {{< note >}} + +- If the network port is not accessible to the Grafana server, select the **Browser** option from the Access menu. +- In a testing environment, you can select **Skip TLS verification**. + + {{< /note >}} + +1. Add dashboards for cluster, database, node, and shard metrics. + To add preconfigured dashboards: + 1. In the Grafana dashboards menu, select **Manage**. + 1. Click **Import**. + 1. Upload one or more [Grafana dashboards](#grafana-dashboards-for-redis-enterprise). + +### Grafana dashboards for Redis Enterprise + +Redis publishes four preconfigured dashboards for Redis Enterprise and Grafana: + +* The [cluster status dashboard](https://grafana.com/grafana/dashboards/18405-cluster-status-dashboard/) provides an overview of your Redis Enterprise clusters. +* The [database status dashboard](https://grafana.com/grafana/dashboards/18408-database-status-dashboard/) displays specific database metrics, including latency, memory usage, ops/second, and key count. +* The [node metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-node-dashboard_v9-11.json) provides metrics for each of the nodes hosting your cluster. +* The [shard metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-shard-dashboard_v9-11.json) displays metrics for the individual Redis processes running on your cluster nodes +* The [Active-Active dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-active-active-dashboard_v9-11.json) displays metrics specific to [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}}). + +These dashboards are open source. For additional dashboard options, or to file an issue, see the [Redis Enterprise observability Github repository](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana). + +For more information about configuring Grafana dashboards, see the [Grafana documentation](https://grafana.com/docs/). + diff --git a/content/operate/rs/monitoring/v1_monitoring/_index.md b/content/operate/rs/monitoring/v1_monitoring/_index.md new file mode 100644 index 0000000000..d650e53b87 --- /dev/null +++ b/content/operate/rs/monitoring/v1_monitoring/_index.md @@ -0,0 +1,15 @@ +--- +Title: Monitoring with metrics and alerts +alwaysopen: false +categories: +- docs +- operate +- rs +- kubernetes +description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. +hideListLinks: true +linkTitle: V1 monitoring +weight: 50 +--- + +TBA diff --git a/content/operate/rs/monitoring/v1_monitoring/observability.md b/content/operate/rs/monitoring/v1_monitoring/observability.md new file mode 100644 index 0000000000..91b55460bb --- /dev/null +++ b/content/operate/rs/monitoring/v1_monitoring/observability.md @@ -0,0 +1,632 @@ +--- +Title: Redis Enterprise Software observability and monitoring guidance +alwaysopen: false +categories: +- docs +- integrate +- rs +description: Using monitoring and observability with Redis Enterprise +group: observability +linkTitle: Observability and monitoring +summary: Observe Redis Enterprise resources and database perfomance indicators. +type: integration +weight: 45 +--- + +## Introduction + +This document provides observability and monitoring guidance for developers running applications +that connect to Redis Enterprise. In particular, this guide focuses on the systems +and resources that are most likely to impact the performance of your application. + +The screenshot below shows a dashboard with relevant statistics for a node: +{{< image filename="/images/node_summary.png" alt="Dashboard showing relevant statistics for a Node" >}} + +To effectively monitor a Redis Enterprise cluster you need to observe +core cluster resources and key database performance indicators as described in the following sections for this guide. + +Core cluster resources include: + +* Memory utilization +* CPU utilization +* Database connections +* Network traffic +* Synchronization + +Key database performance indicators include: + +* Latency +* Cache hit rate +* Key eviction rate +* Proxy Performance + +Dashboard showing an overview of cluster metrics: +{{< image filename="/images/cluster_overview.png" alt="Dashboard showing an overview of cluster metrics" >}} + +In addition to manually monitoring these resources and indicators, it is best practice to set up alerts. + +## Core cluster resource monitoring + +Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at https://:8070/v2. This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. + +The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. See [Monitoring with metrics and alerts]({{}}) for more details. + +If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. You can scrape both existing and new endpoints simultaneously, which lets you create advanced dashboards and transition smoothly. + +### Memory + +Every Redis Enterprise database has a maximum configured memory limit to ensure isolation +in a multi-database cluster. + +| Metric name | Definition | Unit | +| ------ | ------ | :------ | +| Memory usage percentage metric | Percentage of used memory relative to the configured memory limit for a given database | Percentage | + +Dashboard displaying high-level cluster metrics - [Cluster Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/cluster_dashboard_v9-11.json) +{{< image filename="/images/playbook_used-memory.png" alt="Dashboard displaying high-level cluster metrics" >}} + +### Thresholds + +The appropriate memory threshold depends on how the application is using Redis. + +* Caching workloads, which permit Redis to evict keys, can safely use 100% of available memory. +* Non-caching workloads do not permit key eviction and should be closely monitored as soon as memory usage reaches 80%. + +### Caching workloads + +For applications using Redis solely as a cache, you can safely let the memory usage +reach 100% as long as you have an [eviction policy](https://redis.io/blog/cache-eviction-strategies/) in place. This will ensure +that Redis can evict keys while continuing to accept new writes. + +**Note:** Eviction will increase write command latency as Redis has to cleanup the memory/objects before accepting a new write to prevent OOM when memory usage is at 100%. + +While your Redis database is using 100% of available memory in a caching context, +it's still important to monitor performance. The key performance indicators include: + +* Latency +* Cache hit ratio +* Evicted keys + +### Read latency + +**Latency** has two important definitions, depending on context: + +* In the context of Redis itself, latency is **the time it takes for Redis +to respond to a request**. The [Latency](#latency) section below provides a broader discussion of this metric. + + + +* In the context of your application, Latency is **the time it takes for the application +to process a request**. This will include the time it takes to execute both reads and writes +to Redis, as well as calls to other databases and services. Note that its possible for +Redis to report low latency while the application is experiencing high latency. +This may indicate a low cache hit ratio, ultimately caused by insufficient memory. + +You need to monitor both application-level and Redis-level latency to diagnose +caching performance issues in production. + +### Cache hit ratio and eviction + +**Cache hit ratio** is the percentage of read requests that Redis serves successfully. +**Eviction rate** is the rate at which Redis evicts keys from the cache. These metrics +are sometimes inversely correlated: a high eviction rate may cause a low cache hit ratio if too many frequently-used keys are being evicted. + +If the Redis server is empty, the hit ratio will be 0%. As the application runs and the fills the cache, +the hit ratio will increase. + +**When the entire cached working set fits in memory**, the cache hit ratio will reach close to 100% +while the percent of used memory will remain below 100%. + +**When the working set cannot fit in memory**, the eviction policy will start to evict keys. +It is important to choose a policy that generally evicts rarely-used keys to keep the cache hit ratio as high as possible. + +In both cases, keys will may be manually invalidated by the application or evicted through +the uses of TTLs (time-to-live) and an eviction policy. + +The ideal cache hit ratio depends on the application, but generally, the ratio should be greater than 50%. +Low hit ratios coupled with high numbers of object evictions may indicate that your cache is too small. +This can cause thrashing on the application side, a scenario where the cache is constantly being invalidated. + +This means that when your Redis database is using 100% of available memory, you need +to measure the rate of +[key evictions]({{< relref "/operate/rs/references/metrics/database-operations#evicted-objectssec" >}}). + +An acceptable rate of key evictions depends on the total number of keys in the database +and the measure of application-level latency. If application latency is high, +check to see that key evictions have not increased. + +### Eviction Policies + +| Name | Description | +| ------ | :------ | +|noeviction | New values aren’t saved when memory limit is reached. When a database uses replication, this applies to the primary database | +|allkeys-lru | Keeps most recently used keys; removes least recently used (LRU) keys | +|allkeys-lfu | Keeps frequently used keys; removes least frequently used (LFU) keys | +|volatile-lru | Removes least recently used keys with the expire field set to true. | +|volatile-lfu | Removes least frequently used keys with the expire field set to true. | +|allkeys-random | Randomly removes keys to make space for the new data added. | +|volatile-random | Randomly removes keys with expire field set to true. | +|volatile-ttl | Removes keys with expire field set to true and the shortest remaining time-to-live (TTL) value. | + + +### Eviction policy guidelines + +* Use the allkeys-lru policy when you expect a power-law distribution in the popularity of your requests. That is, you expect a subset of elements will be accessed far more often than the rest. This is a good policy to choose if you are unsure. + +* Use the allkeys-random if you have a cyclic access where all the keys are scanned continuously, or when you expect the distribution to be uniform. + +* Use the volatile-ttl if you want to be able to provide hints to Redis about what are good candidates for expiration by using different TTL values when you create your cache objects. + +The volatile-lru and volatile-random policies are mainly useful when you want to use a single instance for both caching and to have a set of persistent keys. However it is usually a better idea to run two Redis instances to solve such a problem. + +**Note:** Setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient because there is no need for an expire configuration for the key to be evicted under memory pressure. + +### Non-caching workloads + +If no eviction policy is enabled, then Redis will stop accepting writes when memory usage reaches 100%. +Therefore, for non-caching workloads, it is best practice to configure an alert at 80% memory usage. +After your database reaches this 80% threshold, you should closely review the rate of memory usage growth. + +### Troubleshooting + +|Issue | Possible causes | Remediation | +| ------ | ------ | :------ | +|Redis memory usage has reached 100% |This may indicate an insufficient Redis memory limit for your application's workload | For non-caching workloads (where eviction is unacceptable), immediately increase the memory limit for the database. You can accomplish this through the Redis Enterprise console or its API. Alternatively, you can contact Redis support to assist. For caching workloads, you need to monitor performance closely. Confirm that you have an [eviction policy]({{< relref "/operate/rs/databases/memory-performance/eviction-policy" >}}) in place. If your application's performance starts to degrade, you may need to increase the memory limit, as described above. | +|Redis has stopped accepting writes | Memory is at 100% and no eviction policy is in place | Increase the database's total amount of memory. If this is for a caching workload, consider enabling an [eviction policy]({{< relref "/operate/rs/databases/memory-performance/eviction-policy" >}}). In addition, you may want to determine whether the application can set a reasonable TTL (time-to-live) on some or all of the data being written to Redis. | +|Cache hit ratio is steadily decreasing | The application's working set size may be steadily increasing. Alternatively, the application may be misconfigured (for example, generating more than one unique cache key per cached item.) | If the working set size is increasing, consider increasing the memory limit for the database. If the application is misconfigured, review the application's cache key generation logic. | + + + +## CPU + +Redis Enterprise provides several CPU metrics: + +| Metric name | Definition | Unit | +| ------ | ------ | :------ | +| Shard CPU | CPU time portion spent by database shards as a percentage | up to 100% per shard | +| Proxy CPU | CPU time portion spent by the cluster's proxy(s) as a percentage | 100% per proxy thread | +| Node CPU (User and System) | CPU time portion spent by all user-space and kernel-level processesas a Percentage | 100% per node CPU | + + +To understand CPU metrics, it's worth recalling how a Redis Enterprise cluster is organized. +A cluster consists of one or more nodes. Each node is a VM (or cloud compute instance) or +a bare-metal server. + +A database is a set of processes, known as shards, deployed across the nodes of a cluster. + +In the dashboard, shard CPU is the CPU utilization of the processes that make up the database. +When diagnosing performance issues, start by looking at shard CPU. + +Dashboard displaying CPU usage - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) +{{< image filename="/images/playbook_database-cpu-shard.png" alt="Dashboard displaying CPU usage" >}} + +### Thresholds + +In general, we define high CPU as any CPU utilization above 80% of total capacity. + +Shard CPU should remain below 80%. Shards are single-threaded, so a shard CPU of 100% means that the shard is fully utilized. + +Display showing Proxy CPU usage - [Proxy Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/proxy_dashboard_v9-11.json) +{{< image filename="/images/playbook_proxy-cpu-usage.png" alt="Display showing Proxy CPU usage" >}} + +Proxy CPU should remain below 80% of total capacity. +The proxy is a multi-threaded process that handles client connections and forwards requests to the appropriate shard. +Because the total number of proxy threads is configurable, the proxy CPU may exceed 100%. +A proxy configured with 6 threads can reach 600% CPU utilization, so in this case, +keeping utilization below 80% means keeping the total proxy CPU usage below 480%. + +Dashboard displaying an ensemble of Node CPU usage data - [Node Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/node_dashboard_v9-11.json) +{{< image filename="/images/node_cpu.png" alt="Dashboard displaying an ensemble of Node CPU usage data" >}} + +Node CPU should also remain below 80% of total capacity. As with the proxy, the node CPU is variable depending +on the CPU capacity of the node. You will need to calibrate your alerting based on the number of cores in your nodes. + +### Troubleshooting + +High CPU utilization has multiple possible causes. Common causes include an under-provisioned cluster, +excess inefficient Redis operations, and hot master shards. + + +| Issue | Possible causes | Remediation +| ------ | ------ | :------ | +|High CPU utilization across all shards of a database | This usually indicates that the database is under-provisioned in terms of number of shards. A secondary cause may be that the application is running too many inefficient Redis operations. | You can detect slow Redis operations by enabling the slow log in the Redis Enterprise UI. First, rule out inefficient Redis operations as the cause of the high CPU utilization. The Latency section below includes a broader discussion of this metric in the context of your application. If inefficient Redis operations are not the cause, then increase the number of shards in the database. | +|High CPU utilization on a single shard, with the remaining shards having low CPU utilization | This usually indicates a master shard with at least one hot key. Hot keys are keys that are accessed extremely frequently (for example, more than 1000 times per second). | Hot key issues generally cannot be resolved by increasing the number of shards. To resolve this issue, see the section on Hot keys below. | +| High Proxy CPU | There are several possible causes of high proxy CPU. First, review the behavior of connections to the database. Frequent cycling of connections, especially with TLS is enabled, can cause high proxy CPU utilization. This is especially true when you see more than 100 connections per second per thread. Such behavior is almost always a sign of a misbehaving application. Review the total number of operations per second against the cluster. If you see more than 50k operations per second per thread, you may need to increase the number of proxy threads. | In the case of high connection cycling, review the application's connection behavior. In the case of high operations per second, [increase the number of proxy threads]({{< relref "/operate/rs/references/cli-utilities/rladmin/tune#tune-proxy" >}}). | +|High Node CPU | You will typically detect high shard or proxy CPU utilization before you detect high node CPU utilization. Use the remediation steps above to address high shard and proxy CPU utilization. In spite of this, if you see high node CPU utilization, you may need to increase the number of nodes in the cluster. | Consider increasing the number of nodes in the cluster and the rebalancing the shards across the new nodes. This is a complex operation and you should do it with the help of Redis support. | +|High System CPU | Most of the issues above will reflect user-space CPU utilization. However, if you see high system CPU utilization, this may indicate a problem at the network or storage level. | Review network bytes in and network bytes out to rule out any unexpected spikes in network traffic. You may need perform some deeper network diagnostics to identify the cause of the high system CPU utilization. For example, with high rates of packet loss, you may need to review network configurations or even the network hardware. | + +## Connections + +The Redis Enterprise database dashboard indicates the total number of connections to the database. + +You should monitor this connection count metric with both a minimum and maximum number of connections in mind. +Based on the number of application instances connecting to Redis (and whether your application uses [connection pooling]({{< relref "/develop/clients/pools-and-muxing" >}})), +you should have a rough idea of the minimum and maximum number of connections you expect to see for any given database. +This number should remain relatively constant over time. + +### Troubleshooting + +| Issue | Possible causes | Remediation | +| ------ | ------ | :------ | +|Fewer connections to Redis than expected |The application may not be connecting to the correct Redis database. There may be a network partition between the application and the Redis database. | Confirm that the application can successfully connect to Redis. This may require consulting the application logs or the application's connection configuration. | +|Connection count continues to grow over time | Your application may not be releasing connections. The most common of such a connection leak is a manually implemented connection pool or a connection pool that is not properly configured. | Review the application's connection configuration | +|Erratic connection counts (for example, spikes and drops) | Application misbehavior ([thundering herds](https://en.wikipedia.org/wiki/Thundering_herd_problem), connection cycling, or networking issues) | Review the application logs and network traffic to determine the cause of the erratic connection counts. | + + +Dashboard displaying connections - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) +{{< image filename="/images/playbook_database-used-connections.png" alt="Dashboard displaying connections" >}} + +### Network ingress/egress + +The network ingress/egress panel shows the amount of data being sent to and received from the database. +Large spikes in network traffic can indicate that the cluster is under-provisioned or that +the application is reading and/or writing unusually [large keys](#large-keys). A correlation between high network traffic +and high CPU utilization may indicate a large key scenario. + +#### Unbalanced database endpoint + +One possible cause of network traffic spikes is that the database endpoint is not located on the same node as the master shards. In addition to added network latency, if data plane internode encryption is enabled, CPU consumption can increase as well. + +One solution is to use the optimal shard placement and proxy policy to ensure endpoints are collocated on nodes hosting master shards. If you need to restore balance (for example, after node failure) you can manually failover shard(s) with the `rladmin` cli tool. + +Extreme network traffic utilization may approach the limits of the underlying network infrastructure. +In this case, the only remediation is to add more nodes to the cluster and scale the database's shards across them. + +## Synchronization + +In Redis Enterprise, geographically-distributed synchronization is based on Conflict-free replicated data types (CRDT) technology. +The Redis Enterprise implementation of CRDT is called an Active-Active database (formerly known as CRDB). +With Active-Active databases, applications can read and write to the same data set from different geographical locations seamlessly and with low latency, without changing the way the application connects to the database. + +An Active-Active architecture is a data resiliency architecture that distributes the database information over multiple data centers using independent and geographically distributed clusters and nodes. +It is a network of separate processing nodes, each having access to a common replicated database such that all nodes can participate in a common application ensuring local low latency with each region being able to run in isolation. + +To achieve consistency between participating clusters, Redis Active-Active synchronization uses a process called the syncer. + +The syncer keeps a replication backlog, which stores changes to the dataset that the syncer sends to other participating clusters. +The syncer uses partial syncs to keep replicas up to date with changes, or a full sync in the event a replica or primary is lost. + +Dashboard displaying connection metrics between zones - [Synchronization Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/synchronization_dashboard_v9-11.json) +{{< image filename="/images/playbook_network-connectivity.png" alt="Dashboard displaying connection metrics between zones" >}} + +CRDT provides three fundamental benefits over other geo-distributed solutions: + +* It offers local latency on read and write operations, regardless of the number of geo-replicated regions and their distance from each other. +* It enables seamless conflict resolution (“conflict-free”) for simple and complex data types like those of Redis core. +* Even if most of the geo-replicated regions in a CRDT database (for example, 3 out of 5) are down, the remaining geo-replicated regions are uninterrupted and can continue to handle read and write operations, ensuring business continuity. + +## Database performance indicators + +There are several key performance indicators that report your database's performance against your application's workload: + +* Latency +* Cache hit rate +* Key eviction rate + +### Latency + +Latency is **the time it takes for Redis to respond to a request**. +Redis Enterprise measures latency from the first byte received by the proxy to the last byte sent in the command's response. + +An adequately provisioned Redis database running efficient Redis operations will report an average latency below 1 millisecond. In fact, it's common to measure +latency in terms of microseconds. Businesses regularly achieve, and sometimes require, average latencies of 400-600 +microseconds. + +Dashboard display of latency metrics - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) +{{< image filename="/images/playbook_database-cluster-latency.png" alt="Dashboard display of latency metrics" >}} + +The metrics distinguish between read and write latency. Understanding whether high latency is due +to read or writes can help you to isolate the underlying issue. + +Note that these latency metrics do not include network round trip time or application-level serialization, +which is why it's essential to measure request latency at the application, as well. + +Display showing a noticeable spike in latency +{{< image filename="/images/latency_spike.png" alt="Display showing a noticeable spike in latency" >}} + +### Troubleshooting + +Here are some possible causes of high database latency. Note that high database latency is just one of the reasons +why application latency might be high. Application latency can be caused by a variety of factors, including +a low cache hit rate. + +| Issue | Possible causes | Remediation | +| ------ | ------ | :------ | +|Slow database operations | Confirm that there are no excessive slow operations in the Redis slow log. | If possible, reduce the number of slow operations being sent to the database.
If this not possible, consider increasing the number of shards in the database. | +|Increased traffic to the database | Review the network traffic and the database operations per second chart to determine if increased traffic is causing the latency. | If the database is underprovisioned due to increased traffic, consider increasing the number of shards in the database. | +|Insufficient CPU | Check to see if the CPU utilization is increasing. | Confirm that slow operations are not causing the high CPU utilization. If the high CPU utilization is due to increased load, consider adding shards to the database. | + +## Cache hit rate + +**Cache hit rate** is the percentage of all read operations that return a response. **Note:** Cache hit rate is a composite statistic that is computed by dividing the number of read hits by the total number of read operations. +When an application tries to read a key that exists, this is known as a **cache hit**. +Alternatively, when an application tries to read a key that does not exist, this is knows as a **cache miss**. + +For caching workloads, the cache hit rate should generally be above 50%, although +the exact ideal cache hit rate can vary greatly depending on the application and depending on whether the cache +is already populated. + +Dashboard showing the cache hit ratio along with read/write misses - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) +{{< image filename="/images/playbook_cache-hit.png" alt="Dashboard showing the cache hit ratio along with read/write misses" >}} + +**Note:** Redis Enterprise actually reports four different cache hit / miss metrics. +These are defined as follows: + +| Metric name | Definition | +| ------ | :------ | +| bdb_read_hits | The number of successful read operations | +| bdb_read_misses | The number of read operations returning null | +| bdb_write_hits | The number of write operations against existing keys | +| bdb_write_misses | The number of write operations that create new keys | + +### Troubleshooting + +Cache hit rate is usually only relevant for caching workloads. Eviction will begin after the database approaches its maximum memory capacity. + +A high or increasing rate of evictions will negatively affect database latency, especially +if the rate of necessary key evictions exceeds the rate of new key insertions. + +See the [Cache hit ratio and eviction](#cache-hit-ratio-and-eviction) section for tips on troubleshooting cache hit rate. + +## Key eviction rate + +They **key eviction rate** is rate at which objects are being evicted from the database. +See [eviction policy]({{< relref "/operate/rs/databases/memory-performance/eviction-policy" >}}) for a discussion of key eviction and its relationship with memory usage. + +Dashboard displaying object evictions - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) +{{< image filename="/images/playbook_eviction-expiration.png" alt="Dashboard displaying object evictions">}} + +## Proxy Performance + +Redis Enterprise Software (RS) provides high-performance data access through a proxy process that manages and optimizes access to shards within the RS cluster. Each node contains a single proxy process. Each proxy can be active and take incoming traffic or it can be passive and wait for failovers. + +### Proxy Policies + + +| Policy | Description | +| ------ | :------ | +|Single | There is only a single proxy that is bound to the database. This is the default database configuration and preferable in most use cases. | +|All Master Shards | There are multiple proxies that are bound to the database, one on each node that hosts a database master shard. This mode fits most use cases that require multiple proxies. | +|All Nodes | There are multiple proxies that are bound to the database, one on each node in the cluster, regardless of whether or not there is a shard from this database on the node. This mode should be used only in special cases, such as using a load balancer. | + +Dashboard displaying proxy thread activity - [Proxy Thread Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/cloud/basic/redis-cloud-proxy-dashboard_v9-11.json) +{{< image filename="/images/proxy-thread-dashboard.png" alt="Dashboard displaying proxy thread activity" >}} + +If you need to, you can tune the number of proxy threads using the [`rladmin tune proxy`]({{< relref "/operate/rs/references/cli-utilities/rladmin/tune#tune-proxy" >}}) command to make the proxy use more CPU cores. +Cores used by the proxy won't be available for Redis, therefore we need to take into account the number of Redis nodes on the host and the total number of available cores. + +The command has a few parameters you can use to set a new number of proxy cores: + +* `id|all` - you can either tune a specific proxy by its id, or all proxies. + +* `mode` - determines whether or not the proxy can automatically adjust the number of threads depending on load. + +* `threads` and `max_threads` - determine the initial number of threads created on startup, and the maximum number of threads allowed. + +* `scale_threshold` - determines the CPU utilization threshold that triggers spawning new threads. This CPU utilization level needs to be maintained for at least scale_duration seconds before automatic scaling is performed. + +The following table indicates ideal proxy thread counts for the specified environments. + + +| Total Cores | Redis (ROR) | Redis on Flash (ROF) | +| ------ | ------ | :------ | +|1|1|1 | +|4|3|3 | +|8|5|3 | +|12|8|4 | +|16|10|5 | +|32|24|10 | +|64/96|32|20 | +|128|32|32 | + + +## Data access anti-patterns + +There are three data access patterns that can limit the performance of your Redis database: + +* Slow operations +* Hot keys +* Large keys + +This section defines each of these patterns and describes how to diagnose and mitigate them. + +## Slow operations + +**Slow operations** are operations that take longer than a few milliseconds to complete. + +Not all Redis operations are equally efficient. +The most efficient Redis operations are O(1) operations; that is, they have a constant time complexity. +Example of such operations include [GET]({{< relref "/commands/get" >}}), +[SET]({{< relref "/commands/set" >}}), [SADD]({{< relref "/commands/sadd" >}}), +and [HSET]({{< relref "/commands/hset" >}}). + +These constant time operations are unlikely to cause high CPU utilization. **Note:** Even so, +it's still possible for a high rate of constant time operations to overwhelm an underprovisioned database. + +Other Redis operations exhibit greater levels of time complexity. +O(n) (linear time) operations are more likely to cause high CPU utilization. +Examples include [HGETALL]({{< relref "/commands/hgetall" >}}), [SMEMBERS]({{< relref "/commands/smembers" >}}), +and [LREM]({{< relref "/commands/lrem" >}}). +These operations are not necessarily problematic, but they can be if executed against data structures holding +a large number of elements (for example, a list with 1 million elements). + +However, the [KEYS]({{< relref "/commands/keys" >}}) command should almost never be run against a +production system, since returning a list of all keys in a large Redis database can cause significant slowdowns +and block other operations. If you need to scan the keyspace, especially in a production cluster, always use the +[SCAN]({{< relref "/commands/scan" >}}) command instead. + +### Troubleshooting + +The best way to discover slow operations is to view the slow log. +The slow log is available in the Redis Enterprise and Redis Cloud consoles: +* [Redis Enterprise slow log docs]({{< relref "/operate/rs/clusters/logging/redis-slow-log" >}}) +* [Redis Cloud slow log docs]({{< relref "/operate/rc/databases/view-edit-database#other-actions-and-info" >}}) + +Redis Cloud dashboard showing slow database operations +{{< image filename="/images/slow_log.png" alt="Redis Cloud dashboard showing slow database operations" >}} + +| Issue | Remediation | +| ------ | :------ | +|The KEYS command shows up in the slow log |Find the application that issues the KEYS command and replace it with a SCAN command. In an emergency situation, you can [alter the ACLs for the database user]({{< relref "/operate/rs/security/access-control/redis-acl-overview" >}}) so that Redis will reject the KEYS command altogether. | +|The slow log shows a significant number of slow, O(n) operations | If these operations are being issued against large data structures, then the application may need to be refactored to use more efficient Redis commands. | +|The slow logs contains only O(1) commands, and these commands are taking several milliseconds or more to complete |This likely indicates that the database is underprovisioned. Consider increasing the number of shards and/or nodes. | + + +## Hot keys + +A **hot key** is a key that is accessed extremely frequently (for example, thousands of times a second or more). + +Each key in Redis belongs to one, and only one, shard. +For this reason, a hot key can cause high CPU utilization on that one shard, +which can increase latency for all other operations. + +### Troubleshooting + +You may suspect that you have a hot key if you see high CPU utilization on a single shard. +There are two main way to identify hot keys: using the Redis CLI and sampling the operations against Redis. + +To use the Redis CLI to identify hot keys: + +1. First confirm that you have enough available memory to enable an eviction policy. +2. Next, enable the LFU (least-frequently used) eviction policy on the database. +3. Finally, run `redis-cli --hotkeys` + +You may also identify hot keys by sampling the operations against Redis. +You can use do this by running the [MONITOR]({{< relref "/commands/monitor" >}}) command +against the high CPU shard. Because this is a potentially high-impact operation, you should only +use this technique as a secondary option. For mission-critical databases, consider +contacting Redis support for assistance. + +### Remediation + +After you discover a hot key, you need to find a way to reduce the number of operations against it. +This means getting an understanding of the application's access pattern and the reasons for such frequent access. + +If the hot key operations are read-only, consider implementing an application-local cache so +that fewer read requests are sent to Redis. For example, even a local cache that expires every 5 seconds +can entirely eliminate a hot key issue. + +## Large keys + +**Large keys** are keys that are hundreds of kilobytes or larger. +High network traffic and high CPU utilization can be caused by large keys. + +### Troubleshooting + +To identify large keys, you can sample the keyspace using the Redis CLI. + +Run `redis-cli --memkeys` against your database to sample the keyspace in real time +and potentially identify the largest keys in your database. + +### Remediation + +Addressing a large key issue requires understanding why the application is creating large keys in the first place. +As such, it's difficult to provide general advice to solving this issue. Resolution often requires a change +to the application's data model or the way it interacts with Redis. + +## Alerting + +The Redis Enterprise observability package includes [a suite of alerts and their associated tests for use with Prometheus](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana#alerts). **Note:** Not all the alerts are appropriate for all environments; for example, installations that do not use persistence have no need for storage alerts. + +The alerts are packaged with [a series of tests](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/tests) +that validate the individual triggers. You can use these tests to validate your modifications to these alerts for specific environments and use cases. + +To use these alerts, install [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/configuration/). +For a comprehensive guide to alerting with Prometheus and Grafana, +see the [Grafana blog post on the subject](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/). + +## Configuring Prometheus + +To configure Prometheus for alerting, open the `prometheus.yml` configuration file. + +Uncomment the `Alertmanager` section of the file. +The following configuration starts Alertmanager and instructs it to listen on its default port of 9093. + +``` +# Alertmanager configuration +alerting: + alertmanagers: + - static_configs: + - targets: + - alertmanager:9093 +``` + +The Rule file section of the config file instructs Alertmanager to read specific rules files. +If you pasted the `alerts.yml` file into `/etc/prometheus` then the following configuration would be required. + +``` +# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. +rule_files: + - "error_rules.yml" + - "alerts.yml" +``` + +After you've done this, restart Prometheus. + +The built-in configuration, `error_rules.yml`, has a single alert: Critical Connection Exception. +If you open the Prometheus console, by default located at port 9090, and select the Alert tab, +you will see this alert, as well as the alerts in any other file you have included as a rules file. + +{{< image filename="/images/playbook_prometheus-alerts.png" alt="prometheus alerts image" >}} + +The following is a list of alerts contained in the `alerts.yml` file. There are several points to consider: + +- Not all Redis Enterprise deployments export all metrics +- Most metrics only alert if the specified trigger persists for a given duration + +## List of alerts + +| Description | Trigger | +| ------ | :------ | +|Average latency has reached a warning level | round(bdb_avg_latency * 1000) > 1 | +|Average latency has reached a critical level indicating system degradation | round(bdb_avg_latency * 1000) > 4 | +|Absence of any connection indicates improper configuration or firewall issue | bdb_conns < 1 | +|A flood of connections has occurred that will impact normal operations | bdb_conns > 64000 | +|Absence of any requests indicates improperly configured clients | bdb_total_req < 1 | +|Excessive number of client requests indicates configuration and/or programmatic issues | bdb_total_req > 1000000 | +|The database in question will soon be unable to accept new data | round((bdb_used_memory/bdb_memory_limit) * 100) > 98 | +|The database in question will be unable to accept new data in two hours | round((bdb_used_memory/bdb_memory_limit) ** 100) < 98 and (predict_linear(bdb_used_memory[15m], 2 ** 3600) / bdb_memory_limit) > 0.3 and round(predict_linear(bdb_used_memory[15m], 2 * 3600)/bdb_memory_limit) > 0.98 | +|Database read operations are failing to find entries more than 50% of the time | (100 * bdb_read_hits)/(bdb_read_hits + bdb_read_misses) < 50 | +|In situations where TTL values are not set this indicates a problem | bdb_evicted_objects > 1 | +|Replication between nodes is not in a satisfactory state | bdb_replicaof_syncer_status > 0 | +|Record synchronization between nodes is not in a satisfactory state | bdb_crdt_syncer_status > 0 | +|The amount by which replication lags behind events is worrisome | bdb_replicaof_syncer_local_ingress_lag_time > 500 | +|The amount by which object replication lags behind events is worrisome | bdb_crdt_syncer_local_ingress_lag_time > 500 | +|The number of active nodes is less than expected | count(node_up) != 3 | +|Persistent storage will soon be exhausted | round((node_persistent_storage_free/node_persistent_storage_avail) * 100) <= 5 | +|Ephemeral storage will soon be exhausted | round((node_ephemeral_storage_free/node_ephemeral_storage_avail) * 100) <= 5 | +|The node in question is close to running out of memory | round((node_available_memory/node_free_memory) * 100) <= 15 | +|The node in question has exceeded expected levels of CPU usage | round((1 - node_cpu_idle) * 100) >= 80 | +|The shard in question is not reachable | redis_up == 0 | +|The master shard is not reachable | floor(redis_master_link_status{role="slave"}) < 1 | +|The shard in question has exceeded expected levels of CPU usage | redis_process_cpu_usage_percent >= 80 | +|The master shard has exceeded expected levels of CPU usage | redis_process_cpu_usage_percent{role="master"} > 0.75 and redis_process_cpu_usage_percent{role="master"} > on (bdb) group_left() (avg by (bdb)(redis_process_cpu_usage_percent{role="master"}) + on(bdb) 1.2 * stddev by (bdb) (redis_process_cpu_usage_percent{role="master"})) | +|The shard in question has an unhealthily high level of connections | redis_connected_clients > 500 | + +## Appendix A: Grafana Dashboards + +Grafana dashboards are available for Redis Enterprise Software and Redis Cloud deployments. + +These dashboards come in three styles, which may be used together to provide +a full picture of your deployment. + +1. Classic dashboards provide detailed information about the cluster, nodes, and individual databases. +2. Basic dashboards provide a high-level overviews of the various cluster components. +3. Extended dashboards. These require a third-party library to perform ReST calls. + +There are also two workflow dashboards for Redis Enterprise software that provide drill-down functionality. + +### Software +- [Basic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/basic) +- [Extended](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/extended) +- [Classic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/classic) + +### Workflow +- [Database](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/workflow/databases) +- [Node](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/workflow/nodes) + +### Cloud +- [Basic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/cloud/basic) +- [Extended](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/cloud/extended) + +**Note:** - The 'workflow' dashboards are intended to be used as a package. Therefore they should all be installed, as they contain links to the other dashboards in the group permitting rapid navigation between the overview and the drill-down views. diff --git a/content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md b/content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md new file mode 100644 index 0000000000..424fbf3844 --- /dev/null +++ b/content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md @@ -0,0 +1,282 @@ +--- +Title: Prometheus metrics v1 +alwaysopen: false +categories: +- docs +- integrate +- rs +description: V1 metrics available to Prometheus. +group: observability +linkTitle: Prometheus metrics v1 +summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. +type: integration +weight: 48 +--- + +You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. + +As of Redis Enterprise Software version 7.8.2, v1 metrics are deprecated but still available. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). + +The following tables include the v1 metrics available to Prometheus. + +## Database metrics + +| Metric | Description | +| ------ | :------ | +| bdb_avg_latency | Average latency of operations on the database (seconds); returned only when there is traffic | +| bdb_avg_latency_max | Highest value of average latency of operations on the database (seconds); returned only when there is traffic | +| bdb_avg_read_latency | Average latency of read operations (seconds); returned only when there is traffic | +| bdb_avg_read_latency_max | Highest value of average latency of read operations (seconds); returned only when there is traffic | +| bdb_avg_write_latency | Average latency of write operations (seconds); returned only when there is traffic | +| bdb_avg_write_latency_max | Highest value of average latency of write operations (seconds); returned only when there is traffic | +| bdb_bigstore_shard_count | Shard count by database and by storage engine (driver - rocksdb / speedb); Only for databases with Auto Tiering enabled | +| bdb_conns | Number of client connections to the database | +| bdb_egress_bytes | Rate of outgoing network traffic from the database (bytes/sec) | +| bdb_egress_bytes_max | Highest value of the rate of outgoing network traffic from the database (bytes/sec) | +| bdb_evicted_objects | Rate of key evictions from database (evictions/sec) | +| bdb_evicted_objects_max | Highest value of the rate of key evictions from database (evictions/sec) | +| bdb_expired_objects | Rate keys expired in database (expirations/sec) | +| bdb_expired_objects_max | Highest value of the rate keys expired in database (expirations/sec) | +| bdb_fork_cpu_system | % cores utilization in system mode for all Redis shard fork child processes of this database | +| bdb_fork_cpu_system_max | Highest value of % cores utilization in system mode for all Redis shard fork child processes of this database | +| bdb_fork_cpu_user | % cores utilization in user mode for all Redis shard fork child processes of this database | +| bdb_fork_cpu_user_max | Highest value of % cores utilization in user mode for all Redis shard fork child processes of this database | +| bdb_ingress_bytes | Rate of incoming network traffic to the database (bytes/sec) | +| bdb_ingress_bytes_max | Highest value of the rate of incoming network traffic to the database (bytes/sec) | +| bdb_instantaneous_ops_per_sec | Request rate handled by all shards of database (ops/sec) | +| bdb_main_thread_cpu_system | % cores utilization in system mode for all Redis shard main threads of this database | +| bdb_main_thread_cpu_system_max | Highest value of % cores utilization in system mode for all Redis shard main threads of this database | +| bdb_main_thread_cpu_user | % cores utilization in user mode for all Redis shard main threads of this database | +| bdb_main_thread_cpu_user_max | Highest value of % cores utilization in user mode for all Redis shard main threads of this database | +| bdb_mem_frag_ratio | RAM fragmentation ratio (RSS / allocated RAM) | +| bdb_mem_size_lua | Redis lua scripting heap size (bytes) | +| bdb_memory_limit | Configured RAM limit for the database | +| bdb_monitor_sessions_count | Number of clients connected in monitor mode to the database | +| bdb_no_of_keys | Number of keys in database | +| bdb_other_req | Rate of other (non read/write) requests on the database (ops/sec) | +| bdb_other_req_max | Highest value of the rate of other (non read/write) requests on the database (ops/sec) | +| bdb_other_res | Rate of other (non read/write) responses on the database (ops/sec) | +| bdb_other_res_max | Highest value of the rate of other (non read/write) responses on the database (ops/sec) | +| bdb_pubsub_channels | Count the pub/sub channels with subscribed clients | +| bdb_pubsub_channels_max | Highest value of count the pub/sub channels with subscribed clients | +| bdb_pubsub_patterns | Count the pub/sub patterns with subscribed clients | +| bdb_pubsub_patterns_max | Highest value of count the pub/sub patterns with subscribed clients | +| bdb_read_hits | Rate of read operations accessing an existing key (ops/sec) | +| bdb_read_hits_max | Highest value of the rate of read operations accessing an existing key (ops/sec) | +| bdb_read_misses | Rate of read operations accessing a non-existing key (ops/sec) | +| bdb_read_misses_max | Highest value of the rate of read operations accessing a non-existing key (ops/sec) | +| bdb_read_req | Rate of read requests on the database (ops/sec) | +| bdb_read_req_max | Highest value of the rate of read requests on the database (ops/sec) | +| bdb_read_res | Rate of read responses on the database (ops/sec) | +| bdb_read_res_max | Highest value of the rate of read responses on the database (ops/sec) | +| bdb_shard_cpu_system | % cores utilization in system mode for all redis shard processes of this database | +| bdb_shard_cpu_system_max | Highest value of % cores utilization in system mode for all Redis shard processes of this database | +| bdb_shard_cpu_user | % cores utilization in user mode for the redis shard process | +| bdb_shard_cpu_user_max | Highest value of % cores utilization in user mode for the Redis shard process | +| bdb_shards_used | Used shard count by database and by shard type (ram / flash) | +| bdb_total_connections_received | Rate of new client connections to the database (connections/sec) | +| bdb_total_connections_received_max | Highest value of the rate of new client connections to the database (connections/sec) | +| bdb_total_req | Rate of all requests on the database (ops/sec) | +| bdb_total_req_max | Highest value of the rate of all requests on the database (ops/sec) | +| bdb_total_res | Rate of all responses on the database (ops/sec) | +| bdb_total_res_max | Highest value of the rate of all responses on the database (ops/sec) | +| bdb_up | Database is up and running | +| bdb_used_memory | Memory used by the database (in BigRedis this includes flash) (bytes) | +| bdb_write_hits | Rate of write operations accessing an existing key (ops/sec) | +| bdb_write_hits_max | Highest value of the rate of write operations accessing an existing key (ops/sec) | +| bdb_write_misses | Rate of write operations accessing a non-existing key (ops/sec) | +| bdb_write_misses_max | Highest value of the rate of write operations accessing a non-existing key (ops/sec) | +| bdb_write_req | Rate of write requests on the database (ops/sec) | +| bdb_write_req_max | Highest value of the rate of write requests on the database (ops/sec) | +| bdb_write_res | Rate of write responses on the database (ops/sec) | +| bdb_write_res_max | Highest value of the rate of write responses on the database (ops/sec) | +| no_of_expires | Current number of volatile keys in the database | + +## Node metrics + +| Metric | Description | +| ------ | :------ | +| node_available_flash | Available flash in the node (bytes) | +| node_available_flash_no_overbooking | Available flash in the node (bytes), without taking into account overbooking | +| node_available_memory | Amount of free memory in the node (bytes) that is available for database provisioning | +| node_available_memory_no_overbooking | Available ram in the node (bytes) without taking into account overbooking | +| node_avg_latency | Average latency of requests handled by endpoints on the node in milliseconds; returned only when there is traffic | +| node_bigstore_free | Sum of free space of back-end flash (used by flash database's BigRedis) on all cluster nodes (bytes); returned only when BigRedis is enabled | +| node_bigstore_iops | Rate of i/o operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | +| node_bigstore_kv_ops | Rate of value read/write operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | +| node_bigstore_throughput | Throughput i/o operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (bytes/sec); returned only when BigRedis is enabled | +| node_cert_expiration_seconds | Certificate expiration (in seconds) per given node; read more about [certificates in Redis Enterprise]({{< relref "/operate/rs/security/certificates" >}}) and [monitoring certificates]({{< relref "/operate/rs/security/certificates/monitor-certificates" >}}) | +| node_conns | Number of clients connected to endpoints on node | +| node_cpu_idle | CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_idle_max | Highest value of CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_idle_median | Average value of CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_idle_min | Lowest value of CPU idle time portion (0-1, multiply by 100 to get percent) | +| node_cpu_system | CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_system_max | Highest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_system_median | Average value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_system_min | Lowest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | +| node_cpu_user | CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | +| node_cpu_user_max | Highest value of CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | +| node_cpu_user_median | Average value of CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | +| node_cpu_user_min | Lowest value of CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | +| node_cur_aof_rewrites | Number of aof rewrites that are currently performed by shards on this node | +| node_egress_bytes | Rate of outgoing network traffic to node (bytes/sec) | +| node_egress_bytes_max | Highest value of the rate of outgoing network traffic to node (bytes/sec) | +| node_egress_bytes_median | Average value of the rate of outgoing network traffic to node (bytes/sec) | +| node_egress_bytes_min | Lowest value of the rate of outgoing network traffic to node (bytes/sec) | +| node_ephemeral_storage_avail | Disk space available to RLEC processes on configured ephemeral disk (bytes) | +| node_ephemeral_storage_free | Free disk space on configured ephemeral disk (bytes) | +| node_free_memory | Free memory in the node (bytes) | +| node_ingress_bytes | Rate of incoming network traffic to node (bytes/sec) | +| node_ingress_bytes_max | Highest value of the rate of incoming network traffic to node (bytes/sec) | +| node_ingress_bytes_median | Average value of the rate of incoming network traffic to node (bytes/sec) | +| node_ingress_bytes_min | Lowest value of the rate of incoming network traffic to node (bytes/sec) | +| node_persistent_storage_avail | Disk space available to RLEC processes on configured persistent disk (bytes) | +| node_persistent_storage_free | Free disk space on configured persistent disk (bytes) | +| node_provisional_flash | Amount of flash available for new shards on this node, taking into account overbooking, max Redis servers, reserved flash and provision and migration thresholds (bytes) | +| node_provisional_flash_no_overbooking | Amount of flash available for new shards on this node, without taking into account overbooking, max Redis servers, reserved flash and provision and migration thresholds (bytes) | +| node_provisional_memory | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases | +| node_provisional_memory_no_overbooking | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases, without taking into account overbooking | +| node_total_req | Request rate handled by endpoints on node (ops/sec) | +| node_up | Node is part of the cluster and is connected | + +## Cluster metrics + +| Metric | Description | +| ------ | :------ | +| cluster_shards_limit | Total shard limit by the license by shard type (ram / flash) | + + +## Proxy metrics + +| Metric | Description | +| ------ | :------ | +| listener_acc_latency | Accumulative latency (sum of the latencies) of all types of commands on the database. For the average latency, divide this value by listener_total_res | +| listener_acc_latency_max | Highest value of accumulative latency of all types of commands on the database | +| listener_acc_other_latency | Accumulative latency (sum of the latencies) of commands that are the type "other" on the database. For the average latency, divide this value by listener_other_res | +| listener_acc_other_latency_max | Highest value of accumulative latency of commands that are the type "other" on the database | +| listener_acc_read_latency | Accumulative latency (sum of the latencies) of commands that are the type "read" on the database. For the average latency, divide this value by listener_read_res | +| listener_acc_read_latency_max | Highest value of accumulative latency of commands that are the type "read" on the database | +| listener_acc_write_latency | Accumulative latency (sum of the latencies) of commands that are the type "write" on the database. For the average latency, divide this value by listener_write_res | +| listener_acc_write_latency_max | Highest value of accumulative latency of commands that are the type "write" on the database | +| listener_auth_cmds | Number of memcached AUTH commands sent to the database | +| listener_auth_cmds_max | Highest value of the number of memcached AUTH commands sent to the database | +| listener_auth_errors | Number of error responses to memcached AUTH commands | +| listener_auth_errors_max | Highest value of the number of error responses to memcached AUTH commands | +| listener_cmd_flush | Number of memcached FLUSH_ALL commands sent to the database | +| listener_cmd_flush_max | Highest value of the number of memcached FLUSH_ALL commands sent to the database | +| listener_cmd_get | Number of memcached GET commands sent to the database | +| listener_cmd_get_max | Highest value of the number of memcached GET commands sent to the database | +| listener_cmd_set | Number of memcached SET commands sent to the database | +| listener_cmd_set_max | Highest value of the number of memcached SET commands sent to the database | +| listener_cmd_touch | Number of memcached TOUCH commands sent to the database | +| listener_cmd_touch_max | Highest value of the number of memcached TOUCH commands sent to the database | +| listener_conns | Number of clients connected to the endpoint | +| listener_egress_bytes | Rate of outgoing network traffic to the endpoint (bytes/sec) | +| listener_egress_bytes_max | Highest value of the rate of outgoing network traffic to the endpoint (bytes/sec) | +| listener_ingress_bytes | Rate of incoming network traffic to the endpoint (bytes/sec) | +| listener_ingress_bytes_max | Highest value of the rate of incoming network traffic to the endpoint (bytes/sec) | +| listener_last_req_time | Time of last command sent to the database | +| listener_last_res_time | Time of last response sent from the database | +| listener_max_connections_exceeded | Number of times the Number of clients connected to the database at the same time has exeeded the max limit | +| listener_max_connections_exceeded_max | Highest value of the number of times the Number of clients connected to the database at the same time has exeeded the max limit | +| listener_monitor_sessions_count | Number of client connected in monitor mode to the endpoint | +| listener_other_req | Rate of other (non read/write) requests on the endpoint (ops/sec) | +| listener_other_req_max | Highest value of the rate of other (non read/write) requests on the endpoint (ops/sec) | +| listener_other_res | Rate of other (non read/write) responses on the endpoint (ops/sec) | +| listener_other_res_max | Highest value of the rate of other (non read/write) responses on the endpoint (ops/sec) | +| listener_other_started_res | Number of responses sent from the database of type "other" | +| listener_other_started_res_max | Highest value of the number of responses sent from the database of type "other" | +| listener_read_req | Rate of read requests on the endpoint (ops/sec) | +| listener_read_req_max | Highest value of the rate of read requests on the endpoint (ops/sec) | +| listener_read_res | Rate of read responses on the endpoint (ops/sec) | +| listener_read_res_max | Highest value of the rate of read responses on the endpoint (ops/sec) | +| listener_read_started_res | Number of responses sent from the database of type "read" | +| listener_read_started_res_max | Highest value of the number of responses sent from the database of type "read" | +| listener_total_connections_received | Rate of new client connections to the endpoint (connections/sec) | +| listener_total_connections_received_max | Highest value of the rate of new client connections to the endpoint (connections/sec) | +| listener_total_req | Request rate handled by the endpoint (ops/sec) | +| listener_total_req_max | Highest value of the rate of all requests on the endpoint (ops/sec) | +| listener_total_res | Rate of all responses on the endpoint (ops/sec) | +| listener_total_res_max | Highest value of the rate of all responses on the endpoint (ops/sec) | +| listener_total_started_res | Number of responses sent from the database of all types | +| listener_total_started_res_max | Highest value of the number of responses sent from the database of all types | +| listener_write_req | Rate of write requests on the endpoint (ops/sec) | +| listener_write_req_max | Highest value of the rate of write requests on the endpoint (ops/sec) | +| listener_write_res | Rate of write responses on the endpoint (ops/sec) | +| listener_write_res_max | Highest value of the rate of write responses on the endpoint (ops/sec) | +| listener_write_started_res | Number of responses sent from the database of type "write" | +| listener_write_started_res_max | Highest value of the number of responses sent from the database of type "write" | + +## Replication metrics + +| Metric | Description | +| ------ | :------ | +| bdb_replicaof_syncer_ingress_bytes | Rate of compressed incoming network traffic to a Replica Of database (bytes/sec) | +| bdb_replicaof_syncer_ingress_bytes_decompressed | Rate of decompressed incoming network traffic to a Replica Of database (bytes/sec) | +| bdb_replicaof_syncer_local_ingress_lag_time | Lag time between the source and the destination for Replica Of traffic (ms) | +| bdb_replicaof_syncer_status | Syncer status for Replica Of traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | +| bdb_crdt_syncer_ingress_bytes | Rate of compressed incoming network traffic to CRDB (bytes/sec) | +| bdb_crdt_syncer_ingress_bytes_decompressed | Rate of decompressed incoming network traffic to CRDB (bytes/sec) | +| bdb_crdt_syncer_local_ingress_lag_time | Lag time between the source and the destination (ms) for CRDB traffic | +| bdb_crdt_syncer_status | Syncer status for CRDB traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | + +## Shard metrics + +| Metric | Description | +| ------ | :------ | +| redis_active_defrag_running | Automatic memory defragmentation current aggressiveness (% cpu) | +| redis_allocator_active | Total used memory, including external fragmentation | +| redis_allocator_allocated | Total allocated memory | +| redis_allocator_resident | Total resident memory (RSS) | +| redis_aof_last_cow_size | Last AOFR, CopyOnWrite memory | +| redis_aof_rewrite_in_progress | The number of simultaneous AOF rewrites that are in progress | +| redis_aof_rewrites | Number of AOF rewrites this process executed | +| redis_aof_delayed_fsync | Number of times an AOF fsync caused delays in the Redis main thread (inducing latency); this can indicate that the disk is slow or overloaded | +| redis_blocked_clients | Count the clients waiting on a blocking call | +| redis_connected_clients | Number of client connections to the specific shard | +| redis_connected_slaves | Number of connected replicas | +| redis_db0_avg_ttl | Average TTL of all volatile keys | +| redis_db0_expires | Total count of volatile keys | +| redis_db0_keys | Total key count | +| redis_evicted_keys | Keys evicted so far (since restart) | +| redis_expire_cycle_cpu_milliseconds | The cumulative amount of time spent on active expiry cycles | +| redis_expired_keys | Keys expired so far (since restart) | +| redis_forwarding_state | Shard forwarding state (on or off) | +| redis_keys_trimmed | The number of keys that were trimmed in the current or last resharding process | +| redis_keyspace_read_hits | Number of read operations accessing an existing keyspace | +| redis_keyspace_read_misses | Number of read operations accessing an non-existing keyspace | +| redis_keyspace_write_hits | Number of write operations accessing an existing keyspace | +| redis_keyspace_write_misses | Number of write operations accessing an non-existing keyspace | +| redis_master_link_status | Indicates if the replica is connected to its master | +| redis_master_repl_offset | Number of bytes sent to replicas by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_master_sync_in_progress | The primary shard is synchronizing (1 true; 0 false) | +| redis_max_process_mem | Current memory limit configured by redis_mgr according to node free memory | +| redis_maxmemory | Current memory limit configured by redis_mgr according to database memory limits | +| redis_mem_aof_buffer | Current size of AOF buffer | +| redis_mem_clients_normal | Current memory used for input and output buffers of non-replica clients | +| redis_mem_clients_slaves | Current memory used for input and output buffers of replica clients | +| redis_mem_fragmentation_ratio | Memory fragmentation ratio (1.3 means 30% overhead) | +| redis_mem_not_counted_for_evict | Portion of used_memory (in bytes) that's not counted for eviction and OOM error | +| redis_mem_replication_backlog | Size of replication backlog | +| redis_module_fork_in_progress | A binary value that indicates if there is an active fork spawned by a module (1) or not (0) | +| redis_process_cpu_system_seconds_total | Shard process system CPU time spent in seconds | +| redis_process_cpu_usage_percent | Shard process cpu usage precentage | +| redis_process_cpu_user_seconds_total | Shard user CPU time spent in seconds | +| redis_process_main_thread_cpu_system_seconds_total | Shard main thread system CPU time spent in seconds | +| redis_process_main_thread_cpu_user_seconds_total | Shard main thread user CPU time spent in seconds | +| redis_process_max_fds | Shard maximum number of open file descriptors | +| redis_process_open_fds | Shard number of open file descriptors | +| redis_process_resident_memory_bytes | Shard resident memory size in bytes | +| redis_process_start_time_seconds | Shard start time of the process since unix epoch in seconds | +| redis_process_virtual_memory_bytes | Shard virtual memory in bytes | +| redis_rdb_bgsave_in_progress | Indication if bgsave is currently in progress | +| redis_rdb_last_cow_size | Last bgsave (or SYNC fork) used CopyOnWrite memory | +| redis_rdb_saves | Total count of bgsaves since process was restarted (including replica fullsync and persistence) | +| redis_repl_touch_bytes | Number of bytes sent to replicas as TOUCH commands by the shard as a result of a READ command that was processed; calculate the throughput for a time period by comparing the value at different times | +| redis_total_commands_processed | Number of commands processed by the shard; calculate the number of commands for a time period by comparing the value at different times | +| redis_total_connections_received | Number of connections received by the shard; calculate the number of connections for a time period by comparing the value at different times | +| redis_total_net_input_bytes | Number of bytes received by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_total_net_output_bytes | Number of bytes sent by the shard; calculate the throughput for a time period by comparing the value at different times | +| redis_up | Shard is up and running | +| redis_used_memory | Memory used by shard (in BigRedis this includes flash) (bytes) | diff --git a/content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md b/content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md new file mode 100644 index 0000000000..6ced231dec --- /dev/null +++ b/content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md @@ -0,0 +1,172 @@ +--- +LinkTitle: Prometheus and Grafana +Title: Prometheus and Grafana with Redis Enterprise Software +alwaysopen: false +categories: +- docs +- integrate +- rs +description: Use Prometheus and Grafana to collect and visualize Redis Enterprise Software metrics. +group: observability +summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise + Software metrics. +type: integration +weight: 5 +--- + +You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. + +Metrics are exposed at the cluster, node, database, shard, and proxy levels. + + +- [Prometheus](https://prometheus.io/) is an open source systems monitoring and alerting toolkit that aggregates metrics from different sources. +- [Grafana](https://grafana.com/) is an open source metrics visualization tool that processes Prometheus data. + +You can use Prometheus and Grafana to: +- Collect and display metrics not available in the [admin console]({{< relref "/operate/rs/references/metrics" >}}) + +- Set up automatic alerts for node or cluster events + +- Display Redis Enterprise Software metrics alongside data from other systems + +{{Graphic showing how Prometheus and Grafana collect and display data from a Redis Enterprise Cluster. Prometheus collects metrics from the Redis Enterprise cluster, and Grafana queries those metrics for visualization.}} + +In each cluster, the metrics_exporter process exposes Prometheus metrics on port 8070. +Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. + +## Quick start + +To get started with Prometheus and Grafana: + +1. Create a directory called 'prometheus' on your local machine. + +1. Within that directory, create a configuration file called `prometheus.yml`. +1. Add the following contents to the configuration file and replace `` with your Redis Enterprise cluster's FQDN: + + {{< note >}} + +We recommend running Prometheus in Docker only for development and testing. + + {{< /note >}} + + ```yml + global: + scrape_interval: 15s + evaluation_interval: 15s + + # Attach these labels to any time series or alerts when communicating with + # external systems (federation, remote storage, Alertmanager). + external_labels: + monitor: "prometheus-stack-monitor" + + # Load and evaluate rules in this file every 'evaluation_interval' seconds. + #rule_files: + # - "first.rules" + # - "second.rules" + + scrape_configs: + # scrape Prometheus itself + - job_name: prometheus + scrape_interval: 10s + scrape_timeout: 5s + static_configs: + - targets: ["localhost:9090"] + + # scrape Redis Enterprise + - job_name: redis-enterprise + scrape_interval: 30s + scrape_timeout: 30s + metrics_path: / + scheme: https + tls_config: + insecure_skip_verify: true + static_configs: + - targets: [":8070"] # For v2, use [":8070/v2"] + ``` + +1. Set up your Prometheus and Grafana servers. + To set up Prometheus and Grafana on Docker: + 1. Create a _docker-compose.yml_ file: + + ```yml + version: '3' + services: + prometheus-server: + image: prom/prometheus + ports: + - 9090:9090 + volumes: + - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml + + grafana-ui: + image: grafana/grafana + ports: + - 3000:3000 + environment: + - GF_SECURITY_ADMIN_PASSWORD=secret + links: + - prometheus-server:prometheus + ``` + + 1. To start the containers, run: + + ```sh + $ docker compose up -d + ``` + + 1. To check that all of the containers are up, run: `docker ps` + 1. In your browser, sign in to Prometheus at http://localhost:9090 to make sure the server is running. + 1. Select **Status** and then **Targets** to check that Prometheus is collecting data from your Redis Enterprise cluster. + + {{The Redis Enterprise target showing that Prometheus is connected to the Redis Enterprise Cluster.}} + + If Prometheus is connected to the cluster, you can type **node_up** in the Expression field on the Prometheus home page to see the cluster metrics. + +1. Configure the Grafana datasource: + 1. Sign in to Grafana. If you installed Grafana locally, go to http://localhost:3000 and sign in with: + + - Username: admin + - Password: secret + + 1. In the Grafana configuration menu, select **Data Sources**. + + 1. Select **Add data source**. + + 1. Select **Prometheus** from the list of data source types. + + {{The Prometheus data source in the list of data sources on Grafana.}} + + 1. Enter the Prometheus configuration information: + + - Name: `redis-enterprise` + - URL: `http://:9090` + + {{The Prometheus connection form in Grafana.}} + + {{< note >}} + +- If the network port is not accessible to the Grafana server, select the **Browser** option from the Access menu. +- In a testing environment, you can select **Skip TLS verification**. + + {{< /note >}} + +1. Add dashboards for cluster, database, node, and shard metrics. + To add preconfigured dashboards: + 1. In the Grafana dashboards menu, select **Manage**. + 1. Click **Import**. + 1. Upload one or more [Grafana dashboards](#grafana-dashboards-for-redis-enterprise). + +### Grafana dashboards for Redis Enterprise + +Redis publishes four preconfigured dashboards for Redis Enterprise and Grafana: + +* The [cluster status dashboard](https://grafana.com/grafana/dashboards/18405-cluster-status-dashboard/) provides an overview of your Redis Enterprise clusters. +* The [database status dashboard](https://grafana.com/grafana/dashboards/18408-database-status-dashboard/) displays specific database metrics, including latency, memory usage, ops/second, and key count. +* The [node metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-node-dashboard_v9-11.json) provides metrics for each of the nodes hosting your cluster. +* The [shard metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-shard-dashboard_v9-11.json) displays metrics for the individual Redis processes running on your cluster nodes +* The [Active-Active dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-active-active-dashboard_v9-11.json) displays metrics specific to [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}}). + +These dashboards are open source. For additional dashboard options, or to file an issue, see the [Redis Enterprise observability Github repository](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana). + +For more information about configuring Grafana dashboards, see the [Grafana documentation](https://grafana.com/docs/). + From c60979f641d71a0cd1dd8e82d3344fcf430b0e03 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 28 Feb 2025 16:51:39 -0500 Subject: [PATCH 2/9] DOC-4800 Additional monitoring reorg --- .../rs-observability.md} | 26 +- .../rs-prometheus-grafana-quickstart.md} | 16 - .../rs-prometheus-metrics-v1.md} | 18 - .../_index.md | 157 +---- .../observability.md | 619 +----------------- .../prometheus-metrics-v1.md | 265 +------- content/operate/rs/monitoring/_index.md | 77 +-- .../db-availability.md | 7 +- .../rs/monitoring/metrics_stream_engine.md | 22 + .../metrics_stream_engine/_index.md | 15 - .../prometheus_and_grafana.md | 173 ----- .../operate/rs/monitoring/observability.md | 17 + .../rs/monitoring/prometheus_and_grafana.md | 18 + .../operate/rs/monitoring/v1_monitoring.md | 87 +++ .../rs/monitoring/v1_monitoring/_index.md | 15 - .../operate/rs/references/metrics/_index.md | 21 +- .../metrics}/prometheus-metrics-v1-to-v2.md | 4 +- .../metrics/prometheus-metrics-v1.md | 21 + .../metrics}/prometheus-metrics-v2.md | 4 +- .../rs-7-8-releases/rs-7-8-2-34.md | 2 +- 20 files changed, 214 insertions(+), 1370 deletions(-) rename content/{operate/rs/monitoring/v1_monitoring/observability.md => embeds/rs-observability.md} (97%) rename content/{operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md => embeds/rs-prometheus-grafana-quickstart.md} (94%) rename content/{operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md => embeds/rs-prometheus-metrics-v1.md} (96%) rename content/operate/rs/{databases/durability-ha => monitoring}/db-availability.md (94%) create mode 100644 content/operate/rs/monitoring/metrics_stream_engine.md delete mode 100644 content/operate/rs/monitoring/metrics_stream_engine/_index.md delete mode 100644 content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md create mode 100644 content/operate/rs/monitoring/observability.md create mode 100644 content/operate/rs/monitoring/prometheus_and_grafana.md create mode 100644 content/operate/rs/monitoring/v1_monitoring.md delete mode 100644 content/operate/rs/monitoring/v1_monitoring/_index.md rename content/operate/rs/{monitoring/metrics_stream_engine => references/metrics}/prometheus-metrics-v1-to-v2.md (81%) create mode 100644 content/operate/rs/references/metrics/prometheus-metrics-v1.md rename content/operate/rs/{monitoring/metrics_stream_engine => references/metrics}/prometheus-metrics-v2.md (79%) diff --git a/content/operate/rs/monitoring/v1_monitoring/observability.md b/content/embeds/rs-observability.md similarity index 97% rename from content/operate/rs/monitoring/v1_monitoring/observability.md rename to content/embeds/rs-observability.md index 91b55460bb..965b92859d 100644 --- a/content/operate/rs/monitoring/v1_monitoring/observability.md +++ b/content/embeds/rs-observability.md @@ -1,17 +1,3 @@ ---- -Title: Redis Enterprise Software observability and monitoring guidance -alwaysopen: false -categories: -- docs -- integrate -- rs -description: Using monitoring and observability with Redis Enterprise -group: observability -linkTitle: Observability and monitoring -summary: Observe Redis Enterprise resources and database perfomance indicators. -type: integration -weight: 45 ---- ## Introduction @@ -47,7 +33,7 @@ In addition to manually monitoring these resources and indicators, it is best pr ## Core cluster resource monitoring -Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at https://:8070/v2. This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. +Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. See [Monitoring with metrics and alerts]({{}}) for more details. @@ -135,7 +121,7 @@ An acceptable rate of key evictions depends on the total number of keys in the d and the measure of application-level latency. If application latency is high, check to see that key evictions have not increased. -### Eviction Policies +### Eviction policies | Name | Description | | ------ | :------ | @@ -376,11 +362,11 @@ See [eviction policy]({{< relref "/operate/rs/databases/memory-performance/evict Dashboard displaying object evictions - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) {{< image filename="/images/playbook_eviction-expiration.png" alt="Dashboard displaying object evictions">}} -## Proxy Performance +## Proxy performance -Redis Enterprise Software (RS) provides high-performance data access through a proxy process that manages and optimizes access to shards within the RS cluster. Each node contains a single proxy process. Each proxy can be active and take incoming traffic or it can be passive and wait for failovers. +Redis Enterprise Software provides high-performance data access through a proxy process that manages and optimizes access to shards within the cluster. Each node contains a single proxy process. Each proxy can be active and take incoming traffic or it can be passive and wait for failovers. -### Proxy Policies +### Proxy policies | Policy | Description | @@ -535,7 +521,7 @@ To use these alerts, install [Prometheus Alertmanager](https://prometheus.io/doc For a comprehensive guide to alerting with Prometheus and Grafana, see the [Grafana blog post on the subject](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/). -## Configuring Prometheus +## Configure Prometheus To configure Prometheus for alerting, open the `prometheus.yml` configuration file. diff --git a/content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md b/content/embeds/rs-prometheus-grafana-quickstart.md similarity index 94% rename from content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md rename to content/embeds/rs-prometheus-grafana-quickstart.md index 6ced231dec..d7918dc5c9 100644 --- a/content/operate/rs/monitoring/v1_monitoring/prometheus_and_grafana.md +++ b/content/embeds/rs-prometheus-grafana-quickstart.md @@ -1,18 +1,3 @@ ---- -LinkTitle: Prometheus and Grafana -Title: Prometheus and Grafana with Redis Enterprise Software -alwaysopen: false -categories: -- docs -- integrate -- rs -description: Use Prometheus and Grafana to collect and visualize Redis Enterprise Software metrics. -group: observability -summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise - Software metrics. -type: integration -weight: 5 ---- You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. @@ -169,4 +154,3 @@ Redis publishes four preconfigured dashboards for Redis Enterprise and Grafana: These dashboards are open source. For additional dashboard options, or to file an issue, see the [Redis Enterprise observability Github repository](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana). For more information about configuring Grafana dashboards, see the [Grafana documentation](https://grafana.com/docs/). - diff --git a/content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md b/content/embeds/rs-prometheus-metrics-v1.md similarity index 96% rename from content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md rename to content/embeds/rs-prometheus-metrics-v1.md index 424fbf3844..9baabd4c34 100644 --- a/content/operate/rs/monitoring/v1_monitoring/prometheus-metrics-v1.md +++ b/content/embeds/rs-prometheus-metrics-v1.md @@ -1,21 +1,3 @@ ---- -Title: Prometheus metrics v1 -alwaysopen: false -categories: -- docs -- integrate -- rs -description: V1 metrics available to Prometheus. -group: observability -linkTitle: Prometheus metrics v1 -summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. -type: integration -weight: 48 ---- - -You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. - -As of Redis Enterprise Software version 7.8.2, v1 metrics are deprecated but still available. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). The following tables include the v1 metrics available to Prometheus. diff --git a/content/integrate/prometheus-with-redis-enterprise/_index.md b/content/integrate/prometheus-with-redis-enterprise/_index.md index 48fa630884..f0594d8d62 100644 --- a/content/integrate/prometheus-with-redis-enterprise/_index.md +++ b/content/integrate/prometheus-with-redis-enterprise/_index.md @@ -15,159 +15,4 @@ weight: 5 tocEmbedHeaders: true --- -You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. - -Metrics are exposed at the cluster, node, database, shard, and proxy levels. - - -- [Prometheus](https://prometheus.io/) is an open source systems monitoring and alerting toolkit that aggregates metrics from different sources. -- [Grafana](https://grafana.com/) is an open source metrics visualization tool that processes Prometheus data. - -You can use Prometheus and Grafana to: -- Collect and display metrics not available in the [admin console]({{< relref "/operate/rs/references/metrics" >}}) - -- Set up automatic alerts for node or cluster events - -- Display Redis Enterprise Software metrics alongside data from other systems - -{{Graphic showing how Prometheus and Grafana collect and display data from a Redis Enterprise Cluster. Prometheus collects metrics from the Redis Enterprise cluster, and Grafana queries those metrics for visualization.}} - -In each cluster, the metrics_exporter process exposes Prometheus metrics on port 8070. -Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. - -## Quick start - -To get started with Prometheus and Grafana: - -1. Create a directory called 'prometheus' on your local machine. - -1. Within that directory, create a configuration file called `prometheus.yml`. -1. Add the following contents to the configuration file and replace `` with your Redis Enterprise cluster's FQDN: - - {{< note >}} - -We recommend running Prometheus in Docker only for development and testing. - - {{< /note >}} - - ```yml - global: - scrape_interval: 15s - evaluation_interval: 15s - - # Attach these labels to any time series or alerts when communicating with - # external systems (federation, remote storage, Alertmanager). - external_labels: - monitor: "prometheus-stack-monitor" - - # Load and evaluate rules in this file every 'evaluation_interval' seconds. - #rule_files: - # - "first.rules" - # - "second.rules" - - scrape_configs: - # scrape Prometheus itself - - job_name: prometheus - scrape_interval: 10s - scrape_timeout: 5s - static_configs: - - targets: ["localhost:9090"] - - # scrape Redis Enterprise - - job_name: redis-enterprise - scrape_interval: 30s - scrape_timeout: 30s - metrics_path: / - scheme: https - tls_config: - insecure_skip_verify: true - static_configs: - - targets: [":8070"] # For v2, use [":8070/v2"] - ``` - -1. Set up your Prometheus and Grafana servers. - To set up Prometheus and Grafana on Docker: - 1. Create a _docker-compose.yml_ file: - - ```yml - version: '3' - services: - prometheus-server: - image: prom/prometheus - ports: - - 9090:9090 - volumes: - - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - - grafana-ui: - image: grafana/grafana - ports: - - 3000:3000 - environment: - - GF_SECURITY_ADMIN_PASSWORD=secret - links: - - prometheus-server:prometheus - ``` - - 1. To start the containers, run: - - ```sh - $ docker compose up -d - ``` - - 1. To check that all of the containers are up, run: `docker ps` - 1. In your browser, sign in to Prometheus at http://localhost:9090 to make sure the server is running. - 1. Select **Status** and then **Targets** to check that Prometheus is collecting data from your Redis Enterprise cluster. - - {{The Redis Enterprise target showing that Prometheus is connected to the Redis Enterprise Cluster.}} - - If Prometheus is connected to the cluster, you can type **node_up** in the Expression field on the Prometheus home page to see the cluster metrics. - -1. Configure the Grafana datasource: - 1. Sign in to Grafana. If you installed Grafana locally, go to http://localhost:3000 and sign in with: - - - Username: admin - - Password: secret - - 1. In the Grafana configuration menu, select **Data Sources**. - - 1. Select **Add data source**. - - 1. Select **Prometheus** from the list of data source types. - - {{The Prometheus data source in the list of data sources on Grafana.}} - - 1. Enter the Prometheus configuration information: - - - Name: `redis-enterprise` - - URL: `http://:9090` - - {{The Prometheus connection form in Grafana.}} - - {{< note >}} - -- If the network port is not accessible to the Grafana server, select the **Browser** option from the Access menu. -- In a testing environment, you can select **Skip TLS verification**. - - {{< /note >}} - -1. Add dashboards for cluster, database, node, and shard metrics. - To add preconfigured dashboards: - 1. In the Grafana dashboards menu, select **Manage**. - 1. Click **Import**. - 1. Upload one or more [Grafana dashboards](#grafana-dashboards-for-redis-enterprise). - -### Grafana dashboards for Redis Enterprise - -Redis publishes four preconfigured dashboards for Redis Enterprise and Grafana: - -* The [cluster status dashboard](https://grafana.com/grafana/dashboards/18405-cluster-status-dashboard/) provides an overview of your Redis Enterprise clusters. -* The [database status dashboard](https://grafana.com/grafana/dashboards/18408-database-status-dashboard/) displays specific database metrics, including latency, memory usage, ops/second, and key count. -* The [node metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-node-dashboard_v9-11.json) provides metrics for each of the nodes hosting your cluster. -* The [shard metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-shard-dashboard_v9-11.json) displays metrics for the individual Redis processes running on your cluster nodes -* The [Active-Active dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-active-active-dashboard_v9-11.json) displays metrics specific to [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}}). - -These dashboards are open source. For additional dashboard options, or to file an issue, see the [Redis Enterprise observability Github repository](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana). - -For more information about configuring Grafana dashboards, see the [Grafana documentation](https://grafana.com/docs/). - +{{}} diff --git a/content/integrate/prometheus-with-redis-enterprise/observability.md b/content/integrate/prometheus-with-redis-enterprise/observability.md index 91b55460bb..6f7e0ba281 100644 --- a/content/integrate/prometheus-with-redis-enterprise/observability.md +++ b/content/integrate/prometheus-with-redis-enterprise/observability.md @@ -11,622 +11,7 @@ linkTitle: Observability and monitoring summary: Observe Redis Enterprise resources and database perfomance indicators. type: integration weight: 45 +tocEmbedHeaders: true --- -## Introduction - -This document provides observability and monitoring guidance for developers running applications -that connect to Redis Enterprise. In particular, this guide focuses on the systems -and resources that are most likely to impact the performance of your application. - -The screenshot below shows a dashboard with relevant statistics for a node: -{{< image filename="/images/node_summary.png" alt="Dashboard showing relevant statistics for a Node" >}} - -To effectively monitor a Redis Enterprise cluster you need to observe -core cluster resources and key database performance indicators as described in the following sections for this guide. - -Core cluster resources include: - -* Memory utilization -* CPU utilization -* Database connections -* Network traffic -* Synchronization - -Key database performance indicators include: - -* Latency -* Cache hit rate -* Key eviction rate -* Proxy Performance - -Dashboard showing an overview of cluster metrics: -{{< image filename="/images/cluster_overview.png" alt="Dashboard showing an overview of cluster metrics" >}} - -In addition to manually monitoring these resources and indicators, it is best practice to set up alerts. - -## Core cluster resource monitoring - -Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at https://:8070/v2. This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. - -The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. See [Monitoring with metrics and alerts]({{}}) for more details. - -If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. You can scrape both existing and new endpoints simultaneously, which lets you create advanced dashboards and transition smoothly. - -### Memory - -Every Redis Enterprise database has a maximum configured memory limit to ensure isolation -in a multi-database cluster. - -| Metric name | Definition | Unit | -| ------ | ------ | :------ | -| Memory usage percentage metric | Percentage of used memory relative to the configured memory limit for a given database | Percentage | - -Dashboard displaying high-level cluster metrics - [Cluster Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/cluster_dashboard_v9-11.json) -{{< image filename="/images/playbook_used-memory.png" alt="Dashboard displaying high-level cluster metrics" >}} - -### Thresholds - -The appropriate memory threshold depends on how the application is using Redis. - -* Caching workloads, which permit Redis to evict keys, can safely use 100% of available memory. -* Non-caching workloads do not permit key eviction and should be closely monitored as soon as memory usage reaches 80%. - -### Caching workloads - -For applications using Redis solely as a cache, you can safely let the memory usage -reach 100% as long as you have an [eviction policy](https://redis.io/blog/cache-eviction-strategies/) in place. This will ensure -that Redis can evict keys while continuing to accept new writes. - -**Note:** Eviction will increase write command latency as Redis has to cleanup the memory/objects before accepting a new write to prevent OOM when memory usage is at 100%. - -While your Redis database is using 100% of available memory in a caching context, -it's still important to monitor performance. The key performance indicators include: - -* Latency -* Cache hit ratio -* Evicted keys - -### Read latency - -**Latency** has two important definitions, depending on context: - -* In the context of Redis itself, latency is **the time it takes for Redis -to respond to a request**. The [Latency](#latency) section below provides a broader discussion of this metric. - - - -* In the context of your application, Latency is **the time it takes for the application -to process a request**. This will include the time it takes to execute both reads and writes -to Redis, as well as calls to other databases and services. Note that its possible for -Redis to report low latency while the application is experiencing high latency. -This may indicate a low cache hit ratio, ultimately caused by insufficient memory. - -You need to monitor both application-level and Redis-level latency to diagnose -caching performance issues in production. - -### Cache hit ratio and eviction - -**Cache hit ratio** is the percentage of read requests that Redis serves successfully. -**Eviction rate** is the rate at which Redis evicts keys from the cache. These metrics -are sometimes inversely correlated: a high eviction rate may cause a low cache hit ratio if too many frequently-used keys are being evicted. - -If the Redis server is empty, the hit ratio will be 0%. As the application runs and the fills the cache, -the hit ratio will increase. - -**When the entire cached working set fits in memory**, the cache hit ratio will reach close to 100% -while the percent of used memory will remain below 100%. - -**When the working set cannot fit in memory**, the eviction policy will start to evict keys. -It is important to choose a policy that generally evicts rarely-used keys to keep the cache hit ratio as high as possible. - -In both cases, keys will may be manually invalidated by the application or evicted through -the uses of TTLs (time-to-live) and an eviction policy. - -The ideal cache hit ratio depends on the application, but generally, the ratio should be greater than 50%. -Low hit ratios coupled with high numbers of object evictions may indicate that your cache is too small. -This can cause thrashing on the application side, a scenario where the cache is constantly being invalidated. - -This means that when your Redis database is using 100% of available memory, you need -to measure the rate of -[key evictions]({{< relref "/operate/rs/references/metrics/database-operations#evicted-objectssec" >}}). - -An acceptable rate of key evictions depends on the total number of keys in the database -and the measure of application-level latency. If application latency is high, -check to see that key evictions have not increased. - -### Eviction Policies - -| Name | Description | -| ------ | :------ | -|noeviction | New values aren’t saved when memory limit is reached. When a database uses replication, this applies to the primary database | -|allkeys-lru | Keeps most recently used keys; removes least recently used (LRU) keys | -|allkeys-lfu | Keeps frequently used keys; removes least frequently used (LFU) keys | -|volatile-lru | Removes least recently used keys with the expire field set to true. | -|volatile-lfu | Removes least frequently used keys with the expire field set to true. | -|allkeys-random | Randomly removes keys to make space for the new data added. | -|volatile-random | Randomly removes keys with expire field set to true. | -|volatile-ttl | Removes keys with expire field set to true and the shortest remaining time-to-live (TTL) value. | - - -### Eviction policy guidelines - -* Use the allkeys-lru policy when you expect a power-law distribution in the popularity of your requests. That is, you expect a subset of elements will be accessed far more often than the rest. This is a good policy to choose if you are unsure. - -* Use the allkeys-random if you have a cyclic access where all the keys are scanned continuously, or when you expect the distribution to be uniform. - -* Use the volatile-ttl if you want to be able to provide hints to Redis about what are good candidates for expiration by using different TTL values when you create your cache objects. - -The volatile-lru and volatile-random policies are mainly useful when you want to use a single instance for both caching and to have a set of persistent keys. However it is usually a better idea to run two Redis instances to solve such a problem. - -**Note:** Setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient because there is no need for an expire configuration for the key to be evicted under memory pressure. - -### Non-caching workloads - -If no eviction policy is enabled, then Redis will stop accepting writes when memory usage reaches 100%. -Therefore, for non-caching workloads, it is best practice to configure an alert at 80% memory usage. -After your database reaches this 80% threshold, you should closely review the rate of memory usage growth. - -### Troubleshooting - -|Issue | Possible causes | Remediation | -| ------ | ------ | :------ | -|Redis memory usage has reached 100% |This may indicate an insufficient Redis memory limit for your application's workload | For non-caching workloads (where eviction is unacceptable), immediately increase the memory limit for the database. You can accomplish this through the Redis Enterprise console or its API. Alternatively, you can contact Redis support to assist. For caching workloads, you need to monitor performance closely. Confirm that you have an [eviction policy]({{< relref "/operate/rs/databases/memory-performance/eviction-policy" >}}) in place. If your application's performance starts to degrade, you may need to increase the memory limit, as described above. | -|Redis has stopped accepting writes | Memory is at 100% and no eviction policy is in place | Increase the database's total amount of memory. If this is for a caching workload, consider enabling an [eviction policy]({{< relref "/operate/rs/databases/memory-performance/eviction-policy" >}}). In addition, you may want to determine whether the application can set a reasonable TTL (time-to-live) on some or all of the data being written to Redis. | -|Cache hit ratio is steadily decreasing | The application's working set size may be steadily increasing. Alternatively, the application may be misconfigured (for example, generating more than one unique cache key per cached item.) | If the working set size is increasing, consider increasing the memory limit for the database. If the application is misconfigured, review the application's cache key generation logic. | - - - -## CPU - -Redis Enterprise provides several CPU metrics: - -| Metric name | Definition | Unit | -| ------ | ------ | :------ | -| Shard CPU | CPU time portion spent by database shards as a percentage | up to 100% per shard | -| Proxy CPU | CPU time portion spent by the cluster's proxy(s) as a percentage | 100% per proxy thread | -| Node CPU (User and System) | CPU time portion spent by all user-space and kernel-level processesas a Percentage | 100% per node CPU | - - -To understand CPU metrics, it's worth recalling how a Redis Enterprise cluster is organized. -A cluster consists of one or more nodes. Each node is a VM (or cloud compute instance) or -a bare-metal server. - -A database is a set of processes, known as shards, deployed across the nodes of a cluster. - -In the dashboard, shard CPU is the CPU utilization of the processes that make up the database. -When diagnosing performance issues, start by looking at shard CPU. - -Dashboard displaying CPU usage - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) -{{< image filename="/images/playbook_database-cpu-shard.png" alt="Dashboard displaying CPU usage" >}} - -### Thresholds - -In general, we define high CPU as any CPU utilization above 80% of total capacity. - -Shard CPU should remain below 80%. Shards are single-threaded, so a shard CPU of 100% means that the shard is fully utilized. - -Display showing Proxy CPU usage - [Proxy Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/proxy_dashboard_v9-11.json) -{{< image filename="/images/playbook_proxy-cpu-usage.png" alt="Display showing Proxy CPU usage" >}} - -Proxy CPU should remain below 80% of total capacity. -The proxy is a multi-threaded process that handles client connections and forwards requests to the appropriate shard. -Because the total number of proxy threads is configurable, the proxy CPU may exceed 100%. -A proxy configured with 6 threads can reach 600% CPU utilization, so in this case, -keeping utilization below 80% means keeping the total proxy CPU usage below 480%. - -Dashboard displaying an ensemble of Node CPU usage data - [Node Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/node_dashboard_v9-11.json) -{{< image filename="/images/node_cpu.png" alt="Dashboard displaying an ensemble of Node CPU usage data" >}} - -Node CPU should also remain below 80% of total capacity. As with the proxy, the node CPU is variable depending -on the CPU capacity of the node. You will need to calibrate your alerting based on the number of cores in your nodes. - -### Troubleshooting - -High CPU utilization has multiple possible causes. Common causes include an under-provisioned cluster, -excess inefficient Redis operations, and hot master shards. - - -| Issue | Possible causes | Remediation -| ------ | ------ | :------ | -|High CPU utilization across all shards of a database | This usually indicates that the database is under-provisioned in terms of number of shards. A secondary cause may be that the application is running too many inefficient Redis operations. | You can detect slow Redis operations by enabling the slow log in the Redis Enterprise UI. First, rule out inefficient Redis operations as the cause of the high CPU utilization. The Latency section below includes a broader discussion of this metric in the context of your application. If inefficient Redis operations are not the cause, then increase the number of shards in the database. | -|High CPU utilization on a single shard, with the remaining shards having low CPU utilization | This usually indicates a master shard with at least one hot key. Hot keys are keys that are accessed extremely frequently (for example, more than 1000 times per second). | Hot key issues generally cannot be resolved by increasing the number of shards. To resolve this issue, see the section on Hot keys below. | -| High Proxy CPU | There are several possible causes of high proxy CPU. First, review the behavior of connections to the database. Frequent cycling of connections, especially with TLS is enabled, can cause high proxy CPU utilization. This is especially true when you see more than 100 connections per second per thread. Such behavior is almost always a sign of a misbehaving application. Review the total number of operations per second against the cluster. If you see more than 50k operations per second per thread, you may need to increase the number of proxy threads. | In the case of high connection cycling, review the application's connection behavior. In the case of high operations per second, [increase the number of proxy threads]({{< relref "/operate/rs/references/cli-utilities/rladmin/tune#tune-proxy" >}}). | -|High Node CPU | You will typically detect high shard or proxy CPU utilization before you detect high node CPU utilization. Use the remediation steps above to address high shard and proxy CPU utilization. In spite of this, if you see high node CPU utilization, you may need to increase the number of nodes in the cluster. | Consider increasing the number of nodes in the cluster and the rebalancing the shards across the new nodes. This is a complex operation and you should do it with the help of Redis support. | -|High System CPU | Most of the issues above will reflect user-space CPU utilization. However, if you see high system CPU utilization, this may indicate a problem at the network or storage level. | Review network bytes in and network bytes out to rule out any unexpected spikes in network traffic. You may need perform some deeper network diagnostics to identify the cause of the high system CPU utilization. For example, with high rates of packet loss, you may need to review network configurations or even the network hardware. | - -## Connections - -The Redis Enterprise database dashboard indicates the total number of connections to the database. - -You should monitor this connection count metric with both a minimum and maximum number of connections in mind. -Based on the number of application instances connecting to Redis (and whether your application uses [connection pooling]({{< relref "/develop/clients/pools-and-muxing" >}})), -you should have a rough idea of the minimum and maximum number of connections you expect to see for any given database. -This number should remain relatively constant over time. - -### Troubleshooting - -| Issue | Possible causes | Remediation | -| ------ | ------ | :------ | -|Fewer connections to Redis than expected |The application may not be connecting to the correct Redis database. There may be a network partition between the application and the Redis database. | Confirm that the application can successfully connect to Redis. This may require consulting the application logs or the application's connection configuration. | -|Connection count continues to grow over time | Your application may not be releasing connections. The most common of such a connection leak is a manually implemented connection pool or a connection pool that is not properly configured. | Review the application's connection configuration | -|Erratic connection counts (for example, spikes and drops) | Application misbehavior ([thundering herds](https://en.wikipedia.org/wiki/Thundering_herd_problem), connection cycling, or networking issues) | Review the application logs and network traffic to determine the cause of the erratic connection counts. | - - -Dashboard displaying connections - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) -{{< image filename="/images/playbook_database-used-connections.png" alt="Dashboard displaying connections" >}} - -### Network ingress/egress - -The network ingress/egress panel shows the amount of data being sent to and received from the database. -Large spikes in network traffic can indicate that the cluster is under-provisioned or that -the application is reading and/or writing unusually [large keys](#large-keys). A correlation between high network traffic -and high CPU utilization may indicate a large key scenario. - -#### Unbalanced database endpoint - -One possible cause of network traffic spikes is that the database endpoint is not located on the same node as the master shards. In addition to added network latency, if data plane internode encryption is enabled, CPU consumption can increase as well. - -One solution is to use the optimal shard placement and proxy policy to ensure endpoints are collocated on nodes hosting master shards. If you need to restore balance (for example, after node failure) you can manually failover shard(s) with the `rladmin` cli tool. - -Extreme network traffic utilization may approach the limits of the underlying network infrastructure. -In this case, the only remediation is to add more nodes to the cluster and scale the database's shards across them. - -## Synchronization - -In Redis Enterprise, geographically-distributed synchronization is based on Conflict-free replicated data types (CRDT) technology. -The Redis Enterprise implementation of CRDT is called an Active-Active database (formerly known as CRDB). -With Active-Active databases, applications can read and write to the same data set from different geographical locations seamlessly and with low latency, without changing the way the application connects to the database. - -An Active-Active architecture is a data resiliency architecture that distributes the database information over multiple data centers using independent and geographically distributed clusters and nodes. -It is a network of separate processing nodes, each having access to a common replicated database such that all nodes can participate in a common application ensuring local low latency with each region being able to run in isolation. - -To achieve consistency between participating clusters, Redis Active-Active synchronization uses a process called the syncer. - -The syncer keeps a replication backlog, which stores changes to the dataset that the syncer sends to other participating clusters. -The syncer uses partial syncs to keep replicas up to date with changes, or a full sync in the event a replica or primary is lost. - -Dashboard displaying connection metrics between zones - [Synchronization Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/synchronization_dashboard_v9-11.json) -{{< image filename="/images/playbook_network-connectivity.png" alt="Dashboard displaying connection metrics between zones" >}} - -CRDT provides three fundamental benefits over other geo-distributed solutions: - -* It offers local latency on read and write operations, regardless of the number of geo-replicated regions and their distance from each other. -* It enables seamless conflict resolution (“conflict-free”) for simple and complex data types like those of Redis core. -* Even if most of the geo-replicated regions in a CRDT database (for example, 3 out of 5) are down, the remaining geo-replicated regions are uninterrupted and can continue to handle read and write operations, ensuring business continuity. - -## Database performance indicators - -There are several key performance indicators that report your database's performance against your application's workload: - -* Latency -* Cache hit rate -* Key eviction rate - -### Latency - -Latency is **the time it takes for Redis to respond to a request**. -Redis Enterprise measures latency from the first byte received by the proxy to the last byte sent in the command's response. - -An adequately provisioned Redis database running efficient Redis operations will report an average latency below 1 millisecond. In fact, it's common to measure -latency in terms of microseconds. Businesses regularly achieve, and sometimes require, average latencies of 400-600 -microseconds. - -Dashboard display of latency metrics - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) -{{< image filename="/images/playbook_database-cluster-latency.png" alt="Dashboard display of latency metrics" >}} - -The metrics distinguish between read and write latency. Understanding whether high latency is due -to read or writes can help you to isolate the underlying issue. - -Note that these latency metrics do not include network round trip time or application-level serialization, -which is why it's essential to measure request latency at the application, as well. - -Display showing a noticeable spike in latency -{{< image filename="/images/latency_spike.png" alt="Display showing a noticeable spike in latency" >}} - -### Troubleshooting - -Here are some possible causes of high database latency. Note that high database latency is just one of the reasons -why application latency might be high. Application latency can be caused by a variety of factors, including -a low cache hit rate. - -| Issue | Possible causes | Remediation | -| ------ | ------ | :------ | -|Slow database operations | Confirm that there are no excessive slow operations in the Redis slow log. | If possible, reduce the number of slow operations being sent to the database.
If this not possible, consider increasing the number of shards in the database. | -|Increased traffic to the database | Review the network traffic and the database operations per second chart to determine if increased traffic is causing the latency. | If the database is underprovisioned due to increased traffic, consider increasing the number of shards in the database. | -|Insufficient CPU | Check to see if the CPU utilization is increasing. | Confirm that slow operations are not causing the high CPU utilization. If the high CPU utilization is due to increased load, consider adding shards to the database. | - -## Cache hit rate - -**Cache hit rate** is the percentage of all read operations that return a response. **Note:** Cache hit rate is a composite statistic that is computed by dividing the number of read hits by the total number of read operations. -When an application tries to read a key that exists, this is known as a **cache hit**. -Alternatively, when an application tries to read a key that does not exist, this is knows as a **cache miss**. - -For caching workloads, the cache hit rate should generally be above 50%, although -the exact ideal cache hit rate can vary greatly depending on the application and depending on whether the cache -is already populated. - -Dashboard showing the cache hit ratio along with read/write misses - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) -{{< image filename="/images/playbook_cache-hit.png" alt="Dashboard showing the cache hit ratio along with read/write misses" >}} - -**Note:** Redis Enterprise actually reports four different cache hit / miss metrics. -These are defined as follows: - -| Metric name | Definition | -| ------ | :------ | -| bdb_read_hits | The number of successful read operations | -| bdb_read_misses | The number of read operations returning null | -| bdb_write_hits | The number of write operations against existing keys | -| bdb_write_misses | The number of write operations that create new keys | - -### Troubleshooting - -Cache hit rate is usually only relevant for caching workloads. Eviction will begin after the database approaches its maximum memory capacity. - -A high or increasing rate of evictions will negatively affect database latency, especially -if the rate of necessary key evictions exceeds the rate of new key insertions. - -See the [Cache hit ratio and eviction](#cache-hit-ratio-and-eviction) section for tips on troubleshooting cache hit rate. - -## Key eviction rate - -They **key eviction rate** is rate at which objects are being evicted from the database. -See [eviction policy]({{< relref "/operate/rs/databases/memory-performance/eviction-policy" >}}) for a discussion of key eviction and its relationship with memory usage. - -Dashboard displaying object evictions - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json) -{{< image filename="/images/playbook_eviction-expiration.png" alt="Dashboard displaying object evictions">}} - -## Proxy Performance - -Redis Enterprise Software (RS) provides high-performance data access through a proxy process that manages and optimizes access to shards within the RS cluster. Each node contains a single proxy process. Each proxy can be active and take incoming traffic or it can be passive and wait for failovers. - -### Proxy Policies - - -| Policy | Description | -| ------ | :------ | -|Single | There is only a single proxy that is bound to the database. This is the default database configuration and preferable in most use cases. | -|All Master Shards | There are multiple proxies that are bound to the database, one on each node that hosts a database master shard. This mode fits most use cases that require multiple proxies. | -|All Nodes | There are multiple proxies that are bound to the database, one on each node in the cluster, regardless of whether or not there is a shard from this database on the node. This mode should be used only in special cases, such as using a load balancer. | - -Dashboard displaying proxy thread activity - [Proxy Thread Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/cloud/basic/redis-cloud-proxy-dashboard_v9-11.json) -{{< image filename="/images/proxy-thread-dashboard.png" alt="Dashboard displaying proxy thread activity" >}} - -If you need to, you can tune the number of proxy threads using the [`rladmin tune proxy`]({{< relref "/operate/rs/references/cli-utilities/rladmin/tune#tune-proxy" >}}) command to make the proxy use more CPU cores. -Cores used by the proxy won't be available for Redis, therefore we need to take into account the number of Redis nodes on the host and the total number of available cores. - -The command has a few parameters you can use to set a new number of proxy cores: - -* `id|all` - you can either tune a specific proxy by its id, or all proxies. - -* `mode` - determines whether or not the proxy can automatically adjust the number of threads depending on load. - -* `threads` and `max_threads` - determine the initial number of threads created on startup, and the maximum number of threads allowed. - -* `scale_threshold` - determines the CPU utilization threshold that triggers spawning new threads. This CPU utilization level needs to be maintained for at least scale_duration seconds before automatic scaling is performed. - -The following table indicates ideal proxy thread counts for the specified environments. - - -| Total Cores | Redis (ROR) | Redis on Flash (ROF) | -| ------ | ------ | :------ | -|1|1|1 | -|4|3|3 | -|8|5|3 | -|12|8|4 | -|16|10|5 | -|32|24|10 | -|64/96|32|20 | -|128|32|32 | - - -## Data access anti-patterns - -There are three data access patterns that can limit the performance of your Redis database: - -* Slow operations -* Hot keys -* Large keys - -This section defines each of these patterns and describes how to diagnose and mitigate them. - -## Slow operations - -**Slow operations** are operations that take longer than a few milliseconds to complete. - -Not all Redis operations are equally efficient. -The most efficient Redis operations are O(1) operations; that is, they have a constant time complexity. -Example of such operations include [GET]({{< relref "/commands/get" >}}), -[SET]({{< relref "/commands/set" >}}), [SADD]({{< relref "/commands/sadd" >}}), -and [HSET]({{< relref "/commands/hset" >}}). - -These constant time operations are unlikely to cause high CPU utilization. **Note:** Even so, -it's still possible for a high rate of constant time operations to overwhelm an underprovisioned database. - -Other Redis operations exhibit greater levels of time complexity. -O(n) (linear time) operations are more likely to cause high CPU utilization. -Examples include [HGETALL]({{< relref "/commands/hgetall" >}}), [SMEMBERS]({{< relref "/commands/smembers" >}}), -and [LREM]({{< relref "/commands/lrem" >}}). -These operations are not necessarily problematic, but they can be if executed against data structures holding -a large number of elements (for example, a list with 1 million elements). - -However, the [KEYS]({{< relref "/commands/keys" >}}) command should almost never be run against a -production system, since returning a list of all keys in a large Redis database can cause significant slowdowns -and block other operations. If you need to scan the keyspace, especially in a production cluster, always use the -[SCAN]({{< relref "/commands/scan" >}}) command instead. - -### Troubleshooting - -The best way to discover slow operations is to view the slow log. -The slow log is available in the Redis Enterprise and Redis Cloud consoles: -* [Redis Enterprise slow log docs]({{< relref "/operate/rs/clusters/logging/redis-slow-log" >}}) -* [Redis Cloud slow log docs]({{< relref "/operate/rc/databases/view-edit-database#other-actions-and-info" >}}) - -Redis Cloud dashboard showing slow database operations -{{< image filename="/images/slow_log.png" alt="Redis Cloud dashboard showing slow database operations" >}} - -| Issue | Remediation | -| ------ | :------ | -|The KEYS command shows up in the slow log |Find the application that issues the KEYS command and replace it with a SCAN command. In an emergency situation, you can [alter the ACLs for the database user]({{< relref "/operate/rs/security/access-control/redis-acl-overview" >}}) so that Redis will reject the KEYS command altogether. | -|The slow log shows a significant number of slow, O(n) operations | If these operations are being issued against large data structures, then the application may need to be refactored to use more efficient Redis commands. | -|The slow logs contains only O(1) commands, and these commands are taking several milliseconds or more to complete |This likely indicates that the database is underprovisioned. Consider increasing the number of shards and/or nodes. | - - -## Hot keys - -A **hot key** is a key that is accessed extremely frequently (for example, thousands of times a second or more). - -Each key in Redis belongs to one, and only one, shard. -For this reason, a hot key can cause high CPU utilization on that one shard, -which can increase latency for all other operations. - -### Troubleshooting - -You may suspect that you have a hot key if you see high CPU utilization on a single shard. -There are two main way to identify hot keys: using the Redis CLI and sampling the operations against Redis. - -To use the Redis CLI to identify hot keys: - -1. First confirm that you have enough available memory to enable an eviction policy. -2. Next, enable the LFU (least-frequently used) eviction policy on the database. -3. Finally, run `redis-cli --hotkeys` - -You may also identify hot keys by sampling the operations against Redis. -You can use do this by running the [MONITOR]({{< relref "/commands/monitor" >}}) command -against the high CPU shard. Because this is a potentially high-impact operation, you should only -use this technique as a secondary option. For mission-critical databases, consider -contacting Redis support for assistance. - -### Remediation - -After you discover a hot key, you need to find a way to reduce the number of operations against it. -This means getting an understanding of the application's access pattern and the reasons for such frequent access. - -If the hot key operations are read-only, consider implementing an application-local cache so -that fewer read requests are sent to Redis. For example, even a local cache that expires every 5 seconds -can entirely eliminate a hot key issue. - -## Large keys - -**Large keys** are keys that are hundreds of kilobytes or larger. -High network traffic and high CPU utilization can be caused by large keys. - -### Troubleshooting - -To identify large keys, you can sample the keyspace using the Redis CLI. - -Run `redis-cli --memkeys` against your database to sample the keyspace in real time -and potentially identify the largest keys in your database. - -### Remediation - -Addressing a large key issue requires understanding why the application is creating large keys in the first place. -As such, it's difficult to provide general advice to solving this issue. Resolution often requires a change -to the application's data model or the way it interacts with Redis. - -## Alerting - -The Redis Enterprise observability package includes [a suite of alerts and their associated tests for use with Prometheus](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana#alerts). **Note:** Not all the alerts are appropriate for all environments; for example, installations that do not use persistence have no need for storage alerts. - -The alerts are packaged with [a series of tests](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/tests) -that validate the individual triggers. You can use these tests to validate your modifications to these alerts for specific environments and use cases. - -To use these alerts, install [Prometheus Alertmanager](https://prometheus.io/docs/alerting/latest/configuration/). -For a comprehensive guide to alerting with Prometheus and Grafana, -see the [Grafana blog post on the subject](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/). - -## Configuring Prometheus - -To configure Prometheus for alerting, open the `prometheus.yml` configuration file. - -Uncomment the `Alertmanager` section of the file. -The following configuration starts Alertmanager and instructs it to listen on its default port of 9093. - -``` -# Alertmanager configuration -alerting: - alertmanagers: - - static_configs: - - targets: - - alertmanager:9093 -``` - -The Rule file section of the config file instructs Alertmanager to read specific rules files. -If you pasted the `alerts.yml` file into `/etc/prometheus` then the following configuration would be required. - -``` -# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. -rule_files: - - "error_rules.yml" - - "alerts.yml" -``` - -After you've done this, restart Prometheus. - -The built-in configuration, `error_rules.yml`, has a single alert: Critical Connection Exception. -If you open the Prometheus console, by default located at port 9090, and select the Alert tab, -you will see this alert, as well as the alerts in any other file you have included as a rules file. - -{{< image filename="/images/playbook_prometheus-alerts.png" alt="prometheus alerts image" >}} - -The following is a list of alerts contained in the `alerts.yml` file. There are several points to consider: - -- Not all Redis Enterprise deployments export all metrics -- Most metrics only alert if the specified trigger persists for a given duration - -## List of alerts - -| Description | Trigger | -| ------ | :------ | -|Average latency has reached a warning level | round(bdb_avg_latency * 1000) > 1 | -|Average latency has reached a critical level indicating system degradation | round(bdb_avg_latency * 1000) > 4 | -|Absence of any connection indicates improper configuration or firewall issue | bdb_conns < 1 | -|A flood of connections has occurred that will impact normal operations | bdb_conns > 64000 | -|Absence of any requests indicates improperly configured clients | bdb_total_req < 1 | -|Excessive number of client requests indicates configuration and/or programmatic issues | bdb_total_req > 1000000 | -|The database in question will soon be unable to accept new data | round((bdb_used_memory/bdb_memory_limit) * 100) > 98 | -|The database in question will be unable to accept new data in two hours | round((bdb_used_memory/bdb_memory_limit) ** 100) < 98 and (predict_linear(bdb_used_memory[15m], 2 ** 3600) / bdb_memory_limit) > 0.3 and round(predict_linear(bdb_used_memory[15m], 2 * 3600)/bdb_memory_limit) > 0.98 | -|Database read operations are failing to find entries more than 50% of the time | (100 * bdb_read_hits)/(bdb_read_hits + bdb_read_misses) < 50 | -|In situations where TTL values are not set this indicates a problem | bdb_evicted_objects > 1 | -|Replication between nodes is not in a satisfactory state | bdb_replicaof_syncer_status > 0 | -|Record synchronization between nodes is not in a satisfactory state | bdb_crdt_syncer_status > 0 | -|The amount by which replication lags behind events is worrisome | bdb_replicaof_syncer_local_ingress_lag_time > 500 | -|The amount by which object replication lags behind events is worrisome | bdb_crdt_syncer_local_ingress_lag_time > 500 | -|The number of active nodes is less than expected | count(node_up) != 3 | -|Persistent storage will soon be exhausted | round((node_persistent_storage_free/node_persistent_storage_avail) * 100) <= 5 | -|Ephemeral storage will soon be exhausted | round((node_ephemeral_storage_free/node_ephemeral_storage_avail) * 100) <= 5 | -|The node in question is close to running out of memory | round((node_available_memory/node_free_memory) * 100) <= 15 | -|The node in question has exceeded expected levels of CPU usage | round((1 - node_cpu_idle) * 100) >= 80 | -|The shard in question is not reachable | redis_up == 0 | -|The master shard is not reachable | floor(redis_master_link_status{role="slave"}) < 1 | -|The shard in question has exceeded expected levels of CPU usage | redis_process_cpu_usage_percent >= 80 | -|The master shard has exceeded expected levels of CPU usage | redis_process_cpu_usage_percent{role="master"} > 0.75 and redis_process_cpu_usage_percent{role="master"} > on (bdb) group_left() (avg by (bdb)(redis_process_cpu_usage_percent{role="master"}) + on(bdb) 1.2 * stddev by (bdb) (redis_process_cpu_usage_percent{role="master"})) | -|The shard in question has an unhealthily high level of connections | redis_connected_clients > 500 | - -## Appendix A: Grafana Dashboards - -Grafana dashboards are available for Redis Enterprise Software and Redis Cloud deployments. - -These dashboards come in three styles, which may be used together to provide -a full picture of your deployment. - -1. Classic dashboards provide detailed information about the cluster, nodes, and individual databases. -2. Basic dashboards provide a high-level overviews of the various cluster components. -3. Extended dashboards. These require a third-party library to perform ReST calls. - -There are also two workflow dashboards for Redis Enterprise software that provide drill-down functionality. - -### Software -- [Basic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/basic) -- [Extended](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/extended) -- [Classic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/classic) - -### Workflow -- [Database](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/workflow/databases) -- [Node](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/workflow/nodes) - -### Cloud -- [Basic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/cloud/basic) -- [Extended](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/cloud/extended) - -**Note:** - The 'workflow' dashboards are intended to be used as a package. Therefore they should all be installed, as they contain links to the other dashboards in the group permitting rapid navigation between the overview and the drill-down views. +{{}} diff --git a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1.md b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1.md index 424fbf3844..d5c51d3970 100644 --- a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1.md +++ b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-v1.md @@ -11,272 +11,11 @@ linkTitle: Prometheus metrics v1 summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. type: integration weight: 48 +tocEmbedHeaders: true --- You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. As of Redis Enterprise Software version 7.8.2, v1 metrics are deprecated but still available. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). -The following tables include the v1 metrics available to Prometheus. - -## Database metrics - -| Metric | Description | -| ------ | :------ | -| bdb_avg_latency | Average latency of operations on the database (seconds); returned only when there is traffic | -| bdb_avg_latency_max | Highest value of average latency of operations on the database (seconds); returned only when there is traffic | -| bdb_avg_read_latency | Average latency of read operations (seconds); returned only when there is traffic | -| bdb_avg_read_latency_max | Highest value of average latency of read operations (seconds); returned only when there is traffic | -| bdb_avg_write_latency | Average latency of write operations (seconds); returned only when there is traffic | -| bdb_avg_write_latency_max | Highest value of average latency of write operations (seconds); returned only when there is traffic | -| bdb_bigstore_shard_count | Shard count by database and by storage engine (driver - rocksdb / speedb); Only for databases with Auto Tiering enabled | -| bdb_conns | Number of client connections to the database | -| bdb_egress_bytes | Rate of outgoing network traffic from the database (bytes/sec) | -| bdb_egress_bytes_max | Highest value of the rate of outgoing network traffic from the database (bytes/sec) | -| bdb_evicted_objects | Rate of key evictions from database (evictions/sec) | -| bdb_evicted_objects_max | Highest value of the rate of key evictions from database (evictions/sec) | -| bdb_expired_objects | Rate keys expired in database (expirations/sec) | -| bdb_expired_objects_max | Highest value of the rate keys expired in database (expirations/sec) | -| bdb_fork_cpu_system | % cores utilization in system mode for all Redis shard fork child processes of this database | -| bdb_fork_cpu_system_max | Highest value of % cores utilization in system mode for all Redis shard fork child processes of this database | -| bdb_fork_cpu_user | % cores utilization in user mode for all Redis shard fork child processes of this database | -| bdb_fork_cpu_user_max | Highest value of % cores utilization in user mode for all Redis shard fork child processes of this database | -| bdb_ingress_bytes | Rate of incoming network traffic to the database (bytes/sec) | -| bdb_ingress_bytes_max | Highest value of the rate of incoming network traffic to the database (bytes/sec) | -| bdb_instantaneous_ops_per_sec | Request rate handled by all shards of database (ops/sec) | -| bdb_main_thread_cpu_system | % cores utilization in system mode for all Redis shard main threads of this database | -| bdb_main_thread_cpu_system_max | Highest value of % cores utilization in system mode for all Redis shard main threads of this database | -| bdb_main_thread_cpu_user | % cores utilization in user mode for all Redis shard main threads of this database | -| bdb_main_thread_cpu_user_max | Highest value of % cores utilization in user mode for all Redis shard main threads of this database | -| bdb_mem_frag_ratio | RAM fragmentation ratio (RSS / allocated RAM) | -| bdb_mem_size_lua | Redis lua scripting heap size (bytes) | -| bdb_memory_limit | Configured RAM limit for the database | -| bdb_monitor_sessions_count | Number of clients connected in monitor mode to the database | -| bdb_no_of_keys | Number of keys in database | -| bdb_other_req | Rate of other (non read/write) requests on the database (ops/sec) | -| bdb_other_req_max | Highest value of the rate of other (non read/write) requests on the database (ops/sec) | -| bdb_other_res | Rate of other (non read/write) responses on the database (ops/sec) | -| bdb_other_res_max | Highest value of the rate of other (non read/write) responses on the database (ops/sec) | -| bdb_pubsub_channels | Count the pub/sub channels with subscribed clients | -| bdb_pubsub_channels_max | Highest value of count the pub/sub channels with subscribed clients | -| bdb_pubsub_patterns | Count the pub/sub patterns with subscribed clients | -| bdb_pubsub_patterns_max | Highest value of count the pub/sub patterns with subscribed clients | -| bdb_read_hits | Rate of read operations accessing an existing key (ops/sec) | -| bdb_read_hits_max | Highest value of the rate of read operations accessing an existing key (ops/sec) | -| bdb_read_misses | Rate of read operations accessing a non-existing key (ops/sec) | -| bdb_read_misses_max | Highest value of the rate of read operations accessing a non-existing key (ops/sec) | -| bdb_read_req | Rate of read requests on the database (ops/sec) | -| bdb_read_req_max | Highest value of the rate of read requests on the database (ops/sec) | -| bdb_read_res | Rate of read responses on the database (ops/sec) | -| bdb_read_res_max | Highest value of the rate of read responses on the database (ops/sec) | -| bdb_shard_cpu_system | % cores utilization in system mode for all redis shard processes of this database | -| bdb_shard_cpu_system_max | Highest value of % cores utilization in system mode for all Redis shard processes of this database | -| bdb_shard_cpu_user | % cores utilization in user mode for the redis shard process | -| bdb_shard_cpu_user_max | Highest value of % cores utilization in user mode for the Redis shard process | -| bdb_shards_used | Used shard count by database and by shard type (ram / flash) | -| bdb_total_connections_received | Rate of new client connections to the database (connections/sec) | -| bdb_total_connections_received_max | Highest value of the rate of new client connections to the database (connections/sec) | -| bdb_total_req | Rate of all requests on the database (ops/sec) | -| bdb_total_req_max | Highest value of the rate of all requests on the database (ops/sec) | -| bdb_total_res | Rate of all responses on the database (ops/sec) | -| bdb_total_res_max | Highest value of the rate of all responses on the database (ops/sec) | -| bdb_up | Database is up and running | -| bdb_used_memory | Memory used by the database (in BigRedis this includes flash) (bytes) | -| bdb_write_hits | Rate of write operations accessing an existing key (ops/sec) | -| bdb_write_hits_max | Highest value of the rate of write operations accessing an existing key (ops/sec) | -| bdb_write_misses | Rate of write operations accessing a non-existing key (ops/sec) | -| bdb_write_misses_max | Highest value of the rate of write operations accessing a non-existing key (ops/sec) | -| bdb_write_req | Rate of write requests on the database (ops/sec) | -| bdb_write_req_max | Highest value of the rate of write requests on the database (ops/sec) | -| bdb_write_res | Rate of write responses on the database (ops/sec) | -| bdb_write_res_max | Highest value of the rate of write responses on the database (ops/sec) | -| no_of_expires | Current number of volatile keys in the database | - -## Node metrics - -| Metric | Description | -| ------ | :------ | -| node_available_flash | Available flash in the node (bytes) | -| node_available_flash_no_overbooking | Available flash in the node (bytes), without taking into account overbooking | -| node_available_memory | Amount of free memory in the node (bytes) that is available for database provisioning | -| node_available_memory_no_overbooking | Available ram in the node (bytes) without taking into account overbooking | -| node_avg_latency | Average latency of requests handled by endpoints on the node in milliseconds; returned only when there is traffic | -| node_bigstore_free | Sum of free space of back-end flash (used by flash database's BigRedis) on all cluster nodes (bytes); returned only when BigRedis is enabled | -| node_bigstore_iops | Rate of i/o operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | -| node_bigstore_kv_ops | Rate of value read/write operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (ops/sec); returned only when BigRedis is enabled | -| node_bigstore_throughput | Throughput i/o operations against back-end flash for all shards which are part of a flash-based database (BigRedis) in the cluster (bytes/sec); returned only when BigRedis is enabled | -| node_cert_expiration_seconds | Certificate expiration (in seconds) per given node; read more about [certificates in Redis Enterprise]({{< relref "/operate/rs/security/certificates" >}}) and [monitoring certificates]({{< relref "/operate/rs/security/certificates/monitor-certificates" >}}) | -| node_conns | Number of clients connected to endpoints on node | -| node_cpu_idle | CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_idle_max | Highest value of CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_idle_median | Average value of CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_idle_min | Lowest value of CPU idle time portion (0-1, multiply by 100 to get percent) | -| node_cpu_system | CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_system_max | Highest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_system_median | Average value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_system_min | Lowest value of CPU time portion spent in the kernel (0-1, multiply by 100 to get percent) | -| node_cpu_user | CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | -| node_cpu_user_max | Highest value of CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | -| node_cpu_user_median | Average value of CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | -| node_cpu_user_min | Lowest value of CPU time portion spent by users-pace processes (0-1, multiply by 100 to get percent) | -| node_cur_aof_rewrites | Number of aof rewrites that are currently performed by shards on this node | -| node_egress_bytes | Rate of outgoing network traffic to node (bytes/sec) | -| node_egress_bytes_max | Highest value of the rate of outgoing network traffic to node (bytes/sec) | -| node_egress_bytes_median | Average value of the rate of outgoing network traffic to node (bytes/sec) | -| node_egress_bytes_min | Lowest value of the rate of outgoing network traffic to node (bytes/sec) | -| node_ephemeral_storage_avail | Disk space available to RLEC processes on configured ephemeral disk (bytes) | -| node_ephemeral_storage_free | Free disk space on configured ephemeral disk (bytes) | -| node_free_memory | Free memory in the node (bytes) | -| node_ingress_bytes | Rate of incoming network traffic to node (bytes/sec) | -| node_ingress_bytes_max | Highest value of the rate of incoming network traffic to node (bytes/sec) | -| node_ingress_bytes_median | Average value of the rate of incoming network traffic to node (bytes/sec) | -| node_ingress_bytes_min | Lowest value of the rate of incoming network traffic to node (bytes/sec) | -| node_persistent_storage_avail | Disk space available to RLEC processes on configured persistent disk (bytes) | -| node_persistent_storage_free | Free disk space on configured persistent disk (bytes) | -| node_provisional_flash | Amount of flash available for new shards on this node, taking into account overbooking, max Redis servers, reserved flash and provision and migration thresholds (bytes) | -| node_provisional_flash_no_overbooking | Amount of flash available for new shards on this node, without taking into account overbooking, max Redis servers, reserved flash and provision and migration thresholds (bytes) | -| node_provisional_memory | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases | -| node_provisional_memory_no_overbooking | Amount of RAM that is available for provisioning to databases out of the total RAM allocated for databases, without taking into account overbooking | -| node_total_req | Request rate handled by endpoints on node (ops/sec) | -| node_up | Node is part of the cluster and is connected | - -## Cluster metrics - -| Metric | Description | -| ------ | :------ | -| cluster_shards_limit | Total shard limit by the license by shard type (ram / flash) | - - -## Proxy metrics - -| Metric | Description | -| ------ | :------ | -| listener_acc_latency | Accumulative latency (sum of the latencies) of all types of commands on the database. For the average latency, divide this value by listener_total_res | -| listener_acc_latency_max | Highest value of accumulative latency of all types of commands on the database | -| listener_acc_other_latency | Accumulative latency (sum of the latencies) of commands that are the type "other" on the database. For the average latency, divide this value by listener_other_res | -| listener_acc_other_latency_max | Highest value of accumulative latency of commands that are the type "other" on the database | -| listener_acc_read_latency | Accumulative latency (sum of the latencies) of commands that are the type "read" on the database. For the average latency, divide this value by listener_read_res | -| listener_acc_read_latency_max | Highest value of accumulative latency of commands that are the type "read" on the database | -| listener_acc_write_latency | Accumulative latency (sum of the latencies) of commands that are the type "write" on the database. For the average latency, divide this value by listener_write_res | -| listener_acc_write_latency_max | Highest value of accumulative latency of commands that are the type "write" on the database | -| listener_auth_cmds | Number of memcached AUTH commands sent to the database | -| listener_auth_cmds_max | Highest value of the number of memcached AUTH commands sent to the database | -| listener_auth_errors | Number of error responses to memcached AUTH commands | -| listener_auth_errors_max | Highest value of the number of error responses to memcached AUTH commands | -| listener_cmd_flush | Number of memcached FLUSH_ALL commands sent to the database | -| listener_cmd_flush_max | Highest value of the number of memcached FLUSH_ALL commands sent to the database | -| listener_cmd_get | Number of memcached GET commands sent to the database | -| listener_cmd_get_max | Highest value of the number of memcached GET commands sent to the database | -| listener_cmd_set | Number of memcached SET commands sent to the database | -| listener_cmd_set_max | Highest value of the number of memcached SET commands sent to the database | -| listener_cmd_touch | Number of memcached TOUCH commands sent to the database | -| listener_cmd_touch_max | Highest value of the number of memcached TOUCH commands sent to the database | -| listener_conns | Number of clients connected to the endpoint | -| listener_egress_bytes | Rate of outgoing network traffic to the endpoint (bytes/sec) | -| listener_egress_bytes_max | Highest value of the rate of outgoing network traffic to the endpoint (bytes/sec) | -| listener_ingress_bytes | Rate of incoming network traffic to the endpoint (bytes/sec) | -| listener_ingress_bytes_max | Highest value of the rate of incoming network traffic to the endpoint (bytes/sec) | -| listener_last_req_time | Time of last command sent to the database | -| listener_last_res_time | Time of last response sent from the database | -| listener_max_connections_exceeded | Number of times the Number of clients connected to the database at the same time has exeeded the max limit | -| listener_max_connections_exceeded_max | Highest value of the number of times the Number of clients connected to the database at the same time has exeeded the max limit | -| listener_monitor_sessions_count | Number of client connected in monitor mode to the endpoint | -| listener_other_req | Rate of other (non read/write) requests on the endpoint (ops/sec) | -| listener_other_req_max | Highest value of the rate of other (non read/write) requests on the endpoint (ops/sec) | -| listener_other_res | Rate of other (non read/write) responses on the endpoint (ops/sec) | -| listener_other_res_max | Highest value of the rate of other (non read/write) responses on the endpoint (ops/sec) | -| listener_other_started_res | Number of responses sent from the database of type "other" | -| listener_other_started_res_max | Highest value of the number of responses sent from the database of type "other" | -| listener_read_req | Rate of read requests on the endpoint (ops/sec) | -| listener_read_req_max | Highest value of the rate of read requests on the endpoint (ops/sec) | -| listener_read_res | Rate of read responses on the endpoint (ops/sec) | -| listener_read_res_max | Highest value of the rate of read responses on the endpoint (ops/sec) | -| listener_read_started_res | Number of responses sent from the database of type "read" | -| listener_read_started_res_max | Highest value of the number of responses sent from the database of type "read" | -| listener_total_connections_received | Rate of new client connections to the endpoint (connections/sec) | -| listener_total_connections_received_max | Highest value of the rate of new client connections to the endpoint (connections/sec) | -| listener_total_req | Request rate handled by the endpoint (ops/sec) | -| listener_total_req_max | Highest value of the rate of all requests on the endpoint (ops/sec) | -| listener_total_res | Rate of all responses on the endpoint (ops/sec) | -| listener_total_res_max | Highest value of the rate of all responses on the endpoint (ops/sec) | -| listener_total_started_res | Number of responses sent from the database of all types | -| listener_total_started_res_max | Highest value of the number of responses sent from the database of all types | -| listener_write_req | Rate of write requests on the endpoint (ops/sec) | -| listener_write_req_max | Highest value of the rate of write requests on the endpoint (ops/sec) | -| listener_write_res | Rate of write responses on the endpoint (ops/sec) | -| listener_write_res_max | Highest value of the rate of write responses on the endpoint (ops/sec) | -| listener_write_started_res | Number of responses sent from the database of type "write" | -| listener_write_started_res_max | Highest value of the number of responses sent from the database of type "write" | - -## Replication metrics - -| Metric | Description | -| ------ | :------ | -| bdb_replicaof_syncer_ingress_bytes | Rate of compressed incoming network traffic to a Replica Of database (bytes/sec) | -| bdb_replicaof_syncer_ingress_bytes_decompressed | Rate of decompressed incoming network traffic to a Replica Of database (bytes/sec) | -| bdb_replicaof_syncer_local_ingress_lag_time | Lag time between the source and the destination for Replica Of traffic (ms) | -| bdb_replicaof_syncer_status | Syncer status for Replica Of traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | -| bdb_crdt_syncer_ingress_bytes | Rate of compressed incoming network traffic to CRDB (bytes/sec) | -| bdb_crdt_syncer_ingress_bytes_decompressed | Rate of decompressed incoming network traffic to CRDB (bytes/sec) | -| bdb_crdt_syncer_local_ingress_lag_time | Lag time between the source and the destination (ms) for CRDB traffic | -| bdb_crdt_syncer_status | Syncer status for CRDB traffic; 0 = in-sync, 1 = syncing, 2 = out of sync | - -## Shard metrics - -| Metric | Description | -| ------ | :------ | -| redis_active_defrag_running | Automatic memory defragmentation current aggressiveness (% cpu) | -| redis_allocator_active | Total used memory, including external fragmentation | -| redis_allocator_allocated | Total allocated memory | -| redis_allocator_resident | Total resident memory (RSS) | -| redis_aof_last_cow_size | Last AOFR, CopyOnWrite memory | -| redis_aof_rewrite_in_progress | The number of simultaneous AOF rewrites that are in progress | -| redis_aof_rewrites | Number of AOF rewrites this process executed | -| redis_aof_delayed_fsync | Number of times an AOF fsync caused delays in the Redis main thread (inducing latency); this can indicate that the disk is slow or overloaded | -| redis_blocked_clients | Count the clients waiting on a blocking call | -| redis_connected_clients | Number of client connections to the specific shard | -| redis_connected_slaves | Number of connected replicas | -| redis_db0_avg_ttl | Average TTL of all volatile keys | -| redis_db0_expires | Total count of volatile keys | -| redis_db0_keys | Total key count | -| redis_evicted_keys | Keys evicted so far (since restart) | -| redis_expire_cycle_cpu_milliseconds | The cumulative amount of time spent on active expiry cycles | -| redis_expired_keys | Keys expired so far (since restart) | -| redis_forwarding_state | Shard forwarding state (on or off) | -| redis_keys_trimmed | The number of keys that were trimmed in the current or last resharding process | -| redis_keyspace_read_hits | Number of read operations accessing an existing keyspace | -| redis_keyspace_read_misses | Number of read operations accessing an non-existing keyspace | -| redis_keyspace_write_hits | Number of write operations accessing an existing keyspace | -| redis_keyspace_write_misses | Number of write operations accessing an non-existing keyspace | -| redis_master_link_status | Indicates if the replica is connected to its master | -| redis_master_repl_offset | Number of bytes sent to replicas by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_master_sync_in_progress | The primary shard is synchronizing (1 true; 0 false) | -| redis_max_process_mem | Current memory limit configured by redis_mgr according to node free memory | -| redis_maxmemory | Current memory limit configured by redis_mgr according to database memory limits | -| redis_mem_aof_buffer | Current size of AOF buffer | -| redis_mem_clients_normal | Current memory used for input and output buffers of non-replica clients | -| redis_mem_clients_slaves | Current memory used for input and output buffers of replica clients | -| redis_mem_fragmentation_ratio | Memory fragmentation ratio (1.3 means 30% overhead) | -| redis_mem_not_counted_for_evict | Portion of used_memory (in bytes) that's not counted for eviction and OOM error | -| redis_mem_replication_backlog | Size of replication backlog | -| redis_module_fork_in_progress | A binary value that indicates if there is an active fork spawned by a module (1) or not (0) | -| redis_process_cpu_system_seconds_total | Shard process system CPU time spent in seconds | -| redis_process_cpu_usage_percent | Shard process cpu usage precentage | -| redis_process_cpu_user_seconds_total | Shard user CPU time spent in seconds | -| redis_process_main_thread_cpu_system_seconds_total | Shard main thread system CPU time spent in seconds | -| redis_process_main_thread_cpu_user_seconds_total | Shard main thread user CPU time spent in seconds | -| redis_process_max_fds | Shard maximum number of open file descriptors | -| redis_process_open_fds | Shard number of open file descriptors | -| redis_process_resident_memory_bytes | Shard resident memory size in bytes | -| redis_process_start_time_seconds | Shard start time of the process since unix epoch in seconds | -| redis_process_virtual_memory_bytes | Shard virtual memory in bytes | -| redis_rdb_bgsave_in_progress | Indication if bgsave is currently in progress | -| redis_rdb_last_cow_size | Last bgsave (or SYNC fork) used CopyOnWrite memory | -| redis_rdb_saves | Total count of bgsaves since process was restarted (including replica fullsync and persistence) | -| redis_repl_touch_bytes | Number of bytes sent to replicas as TOUCH commands by the shard as a result of a READ command that was processed; calculate the throughput for a time period by comparing the value at different times | -| redis_total_commands_processed | Number of commands processed by the shard; calculate the number of commands for a time period by comparing the value at different times | -| redis_total_connections_received | Number of connections received by the shard; calculate the number of connections for a time period by comparing the value at different times | -| redis_total_net_input_bytes | Number of bytes received by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_total_net_output_bytes | Number of bytes sent by the shard; calculate the throughput for a time period by comparing the value at different times | -| redis_up | Shard is up and running | -| redis_used_memory | Memory used by shard (in BigRedis this includes flash) (bytes) | +{{}} diff --git a/content/operate/rs/monitoring/_index.md b/content/operate/rs/monitoring/_index.md index 0a9cc7320d..27789758ab 100644 --- a/content/operate/rs/monitoring/_index.md +++ b/content/operate/rs/monitoring/_index.md @@ -15,80 +15,27 @@ aliases: /operate/rs/clusters/monitoring/ You can use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to monitor the performance of your databases. + +## View metrics and configure alerts + In the Redis Enterprise Cluster Manager UI, you can view metrics, configure alerts, and send notifications based on alert parameters. You can also access metrics and configure alerts through the REST API. +## Metrics stream engine preview + Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. -If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. - -To integrate Redis Enterprise metrics into your monitoring environment, see the integration guides for [Prometheus and Grafana]({{< relref "/integrate/prometheus-with-redis-enterprise/" >}}) or [Uptrace]({{< relref "/integrate/uptrace-with-redis-enterprise/" >}}). - -Make sure you read the [definition of each metric]({{< relref "/operate/rs/references/metrics/" >}}) -so that you understand exactly what it represents. - -## Cluster manager metrics - -You can see the metrics of the cluster in: - -- **Cluster > Metrics** -- **Node > Metrics** for each node -- **Database > Metrics** for each database, including the shards for that database - -The scale selector at the top of the page allows you to set the X-axis (time) scale of the graph. - -To choose which metrics to display in the two large graphs at the top of the page: - -1. Hover over the graph you want to show in a large graph. -1. Click on the right or left arrow to choose which side to show the graph. - -We recommend that you show two similar metrics in the top graphs so you can compare them side-by-side. +If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. -## Cluster alerts +## Integrate with external monitoring tools -In **Cluster > Alert Settings**, you can enable alerts for node or cluster events, such as high memory usage or throughput. +To integrate Redis Enterprise metrics into your monitoring environment, see the integration guides for [Prometheus and Grafana]({{< relref "/operate/rs/monitoring/prometheus_and_grafana" >}}). -Configured alerts are shown: +Filter [Libraries and tools]({{}}) by "observability" for additional tools and guides. -- As a notification on the status icon ( {{< image filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" class="inline" >}} ) for the node and cluster -- In the **log** -- In email notifications, if you configure [email alerts](#send-alerts-by-email) +## Metrics reference -{{< note >}} -If you enable alerts for "Node joined" or "Node removed" actions, -you must also enable "Receive email alerts" so that the notifications are sent. -{{< /note >}} - -To enable alerts for a cluster: - -1. In **Cluster > Alert Settings**, click **Edit**. -1. Select the alerts that you want to show for the cluster and click **Save**. - -## Database alerts - -For each database, you can enable alerts for database events, such as high memory usage or throughput. - -Configured alerts are shown: - -- As a notification on the status icon ( {{< image filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" class="inline" >}} ) for the database -- In the **log** -- In emails, if you configure [email alerts](#send-alerts-by-email) - -To enable alerts for a database: - -1. In **Configuration** for the database, click **Edit**. -1. Select the **Alerts** section to open it. -1. Select the alerts that you want to show for the database and click **Save**. - -## Send alerts by email - -To send cluster and database alerts by email: - -1. In **Cluster > Alert Settings**, click **Edit**. -1. Select **Set an email** to configure the [email server settings]({{< relref "/operate/rs/clusters/configure/cluster-settings#configuring-email-server-settings" >}}). -1. In **Configuration** for the database, click **Edit**. -1. Select the **Alerts** section to open it. -1. Select **Receive email alerts** and click **Save**. -1. In **Access Control**, select the [database and cluster alerts]({{< relref "/operate/rs/security/access-control/manage-users" >}}) that you want each user to receive. +Make sure you read the [definition of each metric]({{< relref "/operate/rs/references/metrics/" >}}) +so that you understand exactly what it represents. diff --git a/content/operate/rs/databases/durability-ha/db-availability.md b/content/operate/rs/monitoring/db-availability.md similarity index 94% rename from content/operate/rs/databases/durability-ha/db-availability.md rename to content/operate/rs/monitoring/db-availability.md index 870179667b..a3bb259f17 100644 --- a/content/operate/rs/databases/durability-ha/db-availability.md +++ b/content/operate/rs/monitoring/db-availability.md @@ -6,10 +6,11 @@ categories: - rs db_type: database description: Verify if a Redis Software database is available to perform read and write operations and can respond to queries from client applications. -linkTitle: Database availability -title: Check database availability +linkTitle: Check database availability +title: Check database availability for monitoring and load balancers toc: 'true' -weight: 30 +weight: 80 +aliases: /operate/rs/databases/durability-ha/db-availability/ --- You can use the [database availability API]({{}}) to verify whether a Redis Software database is available to perform read and write operations and can respond to queries from client applications. Load balancers and automated monitoring tools can use this API to monitor database availability. diff --git a/content/operate/rs/monitoring/metrics_stream_engine.md b/content/operate/rs/monitoring/metrics_stream_engine.md new file mode 100644 index 0000000000..fa4d5c9f11 --- /dev/null +++ b/content/operate/rs/monitoring/metrics_stream_engine.md @@ -0,0 +1,22 @@ +--- +Title: Metrics stream engine preview for monitoring v2 +alwaysopen: false +categories: +- docs +- operate +- rs +- kubernetes +description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. +hideListLinks: true +linkTitle: Metrics stream engine preview for monitoring v2 +weight: 60 +--- + +The latest approach to monitoring Redis Enterprise Software clusters, nodes, databases, and shards no longer includes the internal monitoring systems like the stats API and Cluster Manager metrics and alerts. Instead, you can use the v2 Prometheus scraping endpoint to integrate external monitoring tools such as Prometheus and Grafana + +Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. +This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. + +The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. + +If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. diff --git a/content/operate/rs/monitoring/metrics_stream_engine/_index.md b/content/operate/rs/monitoring/metrics_stream_engine/_index.md deleted file mode 100644 index 7ea1e2cf41..0000000000 --- a/content/operate/rs/monitoring/metrics_stream_engine/_index.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -Title: Metrics stream engine preview -alwaysopen: false -categories: -- docs -- operate -- rs -- kubernetes -description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. -hideListLinks: true -linkTitle: Metrics stream engine - v2 monitoring preview -weight: 60 ---- - -TBA diff --git a/content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md b/content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md deleted file mode 100644 index a5f4e8059c..0000000000 --- a/content/operate/rs/monitoring/metrics_stream_engine/prometheus_and_grafana.md +++ /dev/null @@ -1,173 +0,0 @@ ---- -LinkTitle: Prometheus and Grafana -Title: Prometheus and Grafana with Redis Enterprise Software -alwaysopen: false -categories: -- docs -- integrate -- rs -description: Use Prometheus and Grafana to collect and visualize Redis Enterprise Software metrics. -group: observability -summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise - Software metrics. -type: integration -weight: 5 -tocEmbedHeaders: true ---- - -You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. - -Metrics are exposed at the cluster, node, database, shard, and proxy levels. - - -- [Prometheus](https://prometheus.io/) is an open source systems monitoring and alerting toolkit that aggregates metrics from different sources. -- [Grafana](https://grafana.com/) is an open source metrics visualization tool that processes Prometheus data. - -You can use Prometheus and Grafana to: -- Collect and display metrics not available in the [admin console]({{< relref "/operate/rs/references/metrics" >}}) - -- Set up automatic alerts for node or cluster events - -- Display Redis Enterprise Software metrics alongside data from other systems - -{{Graphic showing how Prometheus and Grafana collect and display data from a Redis Enterprise Cluster. Prometheus collects metrics from the Redis Enterprise cluster, and Grafana queries those metrics for visualization.}} - -In each cluster, the metrics_exporter process exposes Prometheus metrics on port 8070. -Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. - -## Quick start - -To get started with Prometheus and Grafana: - -1. Create a directory called 'prometheus' on your local machine. - -1. Within that directory, create a configuration file called `prometheus.yml`. -1. Add the following contents to the configuration file and replace `` with your Redis Enterprise cluster's FQDN: - - {{< note >}} - -We recommend running Prometheus in Docker only for development and testing. - - {{< /note >}} - - ```yml - global: - scrape_interval: 15s - evaluation_interval: 15s - - # Attach these labels to any time series or alerts when communicating with - # external systems (federation, remote storage, Alertmanager). - external_labels: - monitor: "prometheus-stack-monitor" - - # Load and evaluate rules in this file every 'evaluation_interval' seconds. - #rule_files: - # - "first.rules" - # - "second.rules" - - scrape_configs: - # scrape Prometheus itself - - job_name: prometheus - scrape_interval: 10s - scrape_timeout: 5s - static_configs: - - targets: ["localhost:9090"] - - # scrape Redis Enterprise - - job_name: redis-enterprise - scrape_interval: 30s - scrape_timeout: 30s - metrics_path: / - scheme: https - tls_config: - insecure_skip_verify: true - static_configs: - - targets: [":8070"] # For v2, use [":8070/v2"] - ``` - -1. Set up your Prometheus and Grafana servers. - To set up Prometheus and Grafana on Docker: - 1. Create a _docker-compose.yml_ file: - - ```yml - version: '3' - services: - prometheus-server: - image: prom/prometheus - ports: - - 9090:9090 - volumes: - - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - - grafana-ui: - image: grafana/grafana - ports: - - 3000:3000 - environment: - - GF_SECURITY_ADMIN_PASSWORD=secret - links: - - prometheus-server:prometheus - ``` - - 1. To start the containers, run: - - ```sh - $ docker compose up -d - ``` - - 1. To check that all of the containers are up, run: `docker ps` - 1. In your browser, sign in to Prometheus at http://localhost:9090 to make sure the server is running. - 1. Select **Status** and then **Targets** to check that Prometheus is collecting data from your Redis Enterprise cluster. - - {{The Redis Enterprise target showing that Prometheus is connected to the Redis Enterprise Cluster.}} - - If Prometheus is connected to the cluster, you can type **node_up** in the Expression field on the Prometheus home page to see the cluster metrics. - -1. Configure the Grafana datasource: - 1. Sign in to Grafana. If you installed Grafana locally, go to http://localhost:3000 and sign in with: - - - Username: admin - - Password: secret - - 1. In the Grafana configuration menu, select **Data Sources**. - - 1. Select **Add data source**. - - 1. Select **Prometheus** from the list of data source types. - - {{The Prometheus data source in the list of data sources on Grafana.}} - - 1. Enter the Prometheus configuration information: - - - Name: `redis-enterprise` - - URL: `http://:9090` - - {{The Prometheus connection form in Grafana.}} - - {{< note >}} - -- If the network port is not accessible to the Grafana server, select the **Browser** option from the Access menu. -- In a testing environment, you can select **Skip TLS verification**. - - {{< /note >}} - -1. Add dashboards for cluster, database, node, and shard metrics. - To add preconfigured dashboards: - 1. In the Grafana dashboards menu, select **Manage**. - 1. Click **Import**. - 1. Upload one or more [Grafana dashboards](#grafana-dashboards-for-redis-enterprise). - -### Grafana dashboards for Redis Enterprise - -Redis publishes four preconfigured dashboards for Redis Enterprise and Grafana: - -* The [cluster status dashboard](https://grafana.com/grafana/dashboards/18405-cluster-status-dashboard/) provides an overview of your Redis Enterprise clusters. -* The [database status dashboard](https://grafana.com/grafana/dashboards/18408-database-status-dashboard/) displays specific database metrics, including latency, memory usage, ops/second, and key count. -* The [node metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-node-dashboard_v9-11.json) provides metrics for each of the nodes hosting your cluster. -* The [shard metrics dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-shard-dashboard_v9-11.json) displays metrics for the individual Redis processes running on your cluster nodes -* The [Active-Active dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/basic/redis-software-active-active-dashboard_v9-11.json) displays metrics specific to [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}}). - -These dashboards are open source. For additional dashboard options, or to file an issue, see the [Redis Enterprise observability Github repository](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana). - -For more information about configuring Grafana dashboards, see the [Grafana documentation](https://grafana.com/docs/). - diff --git a/content/operate/rs/monitoring/observability.md b/content/operate/rs/monitoring/observability.md new file mode 100644 index 0000000000..6f7e0ba281 --- /dev/null +++ b/content/operate/rs/monitoring/observability.md @@ -0,0 +1,17 @@ +--- +Title: Redis Enterprise Software observability and monitoring guidance +alwaysopen: false +categories: +- docs +- integrate +- rs +description: Using monitoring and observability with Redis Enterprise +group: observability +linkTitle: Observability and monitoring +summary: Observe Redis Enterprise resources and database perfomance indicators. +type: integration +weight: 45 +tocEmbedHeaders: true +--- + +{{}} diff --git a/content/operate/rs/monitoring/prometheus_and_grafana.md b/content/operate/rs/monitoring/prometheus_and_grafana.md new file mode 100644 index 0000000000..f991def39f --- /dev/null +++ b/content/operate/rs/monitoring/prometheus_and_grafana.md @@ -0,0 +1,18 @@ +--- +LinkTitle: Prometheus and Grafana quick start +Title: Prometheus and Grafana with Redis Enterprise Software quick start +alwaysopen: false +categories: +- docs +- integrate +- rs +description: Use Prometheus and Grafana to collect and visualize Redis Enterprise Software metrics. +group: observability +summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise + Software metrics. +type: integration +weight: 5 +tocEmbedHeaders: true +--- + +{{}} diff --git a/content/operate/rs/monitoring/v1_monitoring.md b/content/operate/rs/monitoring/v1_monitoring.md new file mode 100644 index 0000000000..5f29d3009d --- /dev/null +++ b/content/operate/rs/monitoring/v1_monitoring.md @@ -0,0 +1,87 @@ +--- +Title: Metrics and alerts for monitoring v1 +alwaysopen: false +categories: +- docs +- operate +- rs +- kubernetes +description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. +hideListLinks: true +linkTitle: Monitoring v1 +weight: 50 +--- + +The current approach to monitoring Redis Enterprise Software clusters, nodes, databases, and shards includes: + +- Internal monitoring systems: + + - All stats-api + + - Cluster Manager metrics and alerts + +- The v1 Prometheus scraping endpoint to integrate with external monitoring tools such as Prometheus and Grafana. + +## Cluster manager metrics + +You can see the metrics of the cluster in: + +- **Cluster > Metrics** +- **Node > Metrics** for each node +- **Database > Metrics** for each database, including the shards for that database + +The scale selector at the top of the page allows you to set the X-axis (time) scale of the graph. + +To choose which metrics to display in the two large graphs at the top of the page: + +1. Hover over the graph you want to show in a large graph. +1. Click on the right or left arrow to choose which side to show the graph. + +We recommend that you show two similar metrics in the top graphs so you can compare them side-by-side. + +## Cluster alerts + +In **Cluster > Alert Settings**, you can enable alerts for node or cluster events, such as high memory usage or throughput. + +Configured alerts are shown: + +- As a notification on the status icon ( {{< image filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" class="inline" >}} ) for the node and cluster +- In the **log** +- In email notifications, if you configure [email alerts](#send-alerts-by-email) + +{{< note >}} +If you enable alerts for "Node joined" or "Node removed" actions, +you must also enable "Receive email alerts" so that the notifications are sent. +{{< /note >}} + +To enable alerts for a cluster: + +1. In **Cluster > Alert Settings**, click **Edit**. +1. Select the alerts that you want to show for the cluster and click **Save**. + +## Database alerts + +For each database, you can enable alerts for database events, such as high memory usage or throughput. + +Configured alerts are shown: + +- As a notification on the status icon ( {{< image filename="/images/rs/icons/icon_warning.png#no-click" alt="Warning" width="18px" class="inline" >}} ) for the database +- In the **log** +- In emails, if you configure [email alerts](#send-alerts-by-email) + +To enable alerts for a database: + +1. In **Configuration** for the database, click **Edit**. +1. Select the **Alerts** section to open it. +1. Select the alerts that you want to show for the database and click **Save**. + +## Send alerts by email + +To send cluster and database alerts by email: + +1. In **Cluster > Alert Settings**, click **Edit**. +1. Select **Set an email** to configure the [email server settings]({{< relref "/operate/rs/clusters/configure/cluster-settings#configuring-email-server-settings" >}}). +1. In **Configuration** for the database, click **Edit**. +1. Select the **Alerts** section to open it. +1. Select **Receive email alerts** and click **Save**. +1. In **Access Control**, select the [database and cluster alerts]({{< relref "/operate/rs/security/access-control/manage-users" >}}) that you want each user to receive. diff --git a/content/operate/rs/monitoring/v1_monitoring/_index.md b/content/operate/rs/monitoring/v1_monitoring/_index.md deleted file mode 100644 index d650e53b87..0000000000 --- a/content/operate/rs/monitoring/v1_monitoring/_index.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -Title: Monitoring with metrics and alerts -alwaysopen: false -categories: -- docs -- operate -- rs -- kubernetes -description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. -hideListLinks: true -linkTitle: V1 monitoring -weight: 50 ---- - -TBA diff --git a/content/operate/rs/references/metrics/_index.md b/content/operate/rs/references/metrics/_index.md index 498b0ae9eb..e8fe0d5dfa 100644 --- a/content/operate/rs/references/metrics/_index.md +++ b/content/operate/rs/references/metrics/_index.md @@ -22,17 +22,30 @@ See the following topics for metrics definitions: ## Prometheus metrics To collect and display metrics data from your databases and other cluster components, -you can connect your [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) server to your Redis Enterprise Software cluster. See [Metrics in Prometheus]({{< relref "/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions" >}}) for a list of available metrics. +you can connect your [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/) server to your Redis Enterprise Software cluster. We recommend you use Prometheus and Grafana to view metrics history and trends. -We recommend you use Prometheus and Grafana to view metrics history and trends. +See [Prometheus integration]({{< relref "/operate/rs/monitoring/prometheus_and_grafana" >}}) to learn how to connect Prometheus and Grafana to your Redis Enterprise database. -See [Prometheus integration]({{< relref "/integrate/prometheus-with-redis-enterprise/" >}}) to learn how to connect Prometheus and Grafana to your Redis Enterprise database. +Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. +This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. + +The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. + +If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. + +For a list of available metrics, see the following references: + +- [Prometheus metrics v1]({{}}) + +- [Prometheus metrics v2 preview]({{}}) + +If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. ## Limitations ### Shard limit -Metrics information is not shown for clusters with more than 128 shards. For large clusters, we recommend you use [Prometheus and Grafana]({{< relref "/integrate/prometheus-with-redis-enterprise/" >}}) to view metrics. +Metrics information is not shown for clusters with more than 128 shards. For large clusters, we recommend you use [Prometheus and Grafana]({{< relref "/operate/rs/monitoring/prometheus_and_grafana" >}}) to view metrics. ### Metrics not shown during shard migration diff --git a/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md b/content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md similarity index 81% rename from content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md rename to content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md index 3c50929277..a7222c6e0f 100644 --- a/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v1-to-v2.md +++ b/content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md @@ -14,8 +14,8 @@ weight: 49 tocEmbedHeaders: true --- -You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. +You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. -As of Redis Enterprise Software version 7.8.2, [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/) metrics are available. V1 metrics are deprecated but still available. You can use the following tables to transition from v1 metrics to equivalent v2 PromQL. For a list of all available v2 PromQL metrics, see [Prometheus metrics v2]({{}}). +As of Redis Enterprise Software version 7.8.2, [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/) metrics are available. V1 metrics are deprecated but still available. You can use the following tables to transition from v1 metrics to equivalent v2 PromQL. For a list of all available v2 PromQL metrics, see [Prometheus metrics v2]({{}}). {{}} diff --git a/content/operate/rs/references/metrics/prometheus-metrics-v1.md b/content/operate/rs/references/metrics/prometheus-metrics-v1.md new file mode 100644 index 0000000000..d70e3c6c69 --- /dev/null +++ b/content/operate/rs/references/metrics/prometheus-metrics-v1.md @@ -0,0 +1,21 @@ +--- +Title: Prometheus metrics v1 +alwaysopen: false +categories: +- docs +- integrate +- rs +description: V1 metrics available to Prometheus. +group: observability +linkTitle: Prometheus metrics v1 +summary: You can use Prometheus and Grafana to collect and visualize your Redis Enterprise Software metrics. +type: integration +weight: 48 +tocEmbedHeaders: true +--- + +You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. + +As of Redis Enterprise Software version 7.8.2, v1 metrics are deprecated but still available. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). + +{{}} diff --git a/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md b/content/operate/rs/references/metrics/prometheus-metrics-v2.md similarity index 79% rename from content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md rename to content/operate/rs/references/metrics/prometheus-metrics-v2.md index 91d02ba868..be60b615b3 100644 --- a/content/operate/rs/monitoring/metrics_stream_engine/prometheus-metrics-v2.md +++ b/content/operate/rs/references/metrics/prometheus-metrics-v2.md @@ -18,8 +18,8 @@ tocEmbedHeaders: true While the metrics stream engine is in preview, this document provides only a partial list of v2 metrics. More metrics will be added. {{}} -You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. +You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. -The v2 metrics in the following tables are available as of Redis Enterprise Software version 7.8.0. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). +The v2 metrics in the following tables are available as of Redis Enterprise Software version 7.8.0. For help transitioning from v1 metrics to v2 PromQL, see [Prometheus v1 metrics and equivalent v2 PromQL]({{}}). {{}} diff --git a/content/operate/rs/release-notes/rs-7-8-releases/rs-7-8-2-34.md b/content/operate/rs/release-notes/rs-7-8-releases/rs-7-8-2-34.md index 094c9c0203..5c28e62df1 100644 --- a/content/operate/rs/release-notes/rs-7-8-releases/rs-7-8-2-34.md +++ b/content/operate/rs/release-notes/rs-7-8-releases/rs-7-8-2-34.md @@ -69,7 +69,7 @@ This version offers: - Load balancers and automated monitoring tools can use this API to monitor database availability. - - See [Check database availability]({{}}) and the [REST API reference]({{}}) for details. + - See [Check database availability]({{}}) and the [REST API reference]({{}}) for details. - Metrics stream engine preview: From 000a4cfb9bb6886e18c3f0a0b5c175a3b05bd79d Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Tue, 4 Mar 2025 09:51:54 -0500 Subject: [PATCH 3/9] DOC-4800 Fixed v2 metrics_path comment in Prometheus quick start --- content/embeds/rs-prometheus-grafana-quickstart.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/embeds/rs-prometheus-grafana-quickstart.md b/content/embeds/rs-prometheus-grafana-quickstart.md index d7918dc5c9..5a51dbe22b 100644 --- a/content/embeds/rs-prometheus-grafana-quickstart.md +++ b/content/embeds/rs-prometheus-grafana-quickstart.md @@ -61,12 +61,12 @@ We recommend running Prometheus in Docker only for development and testing. - job_name: redis-enterprise scrape_interval: 30s scrape_timeout: 30s - metrics_path: / + metrics_path: / # For v2, use /v2 scheme: https tls_config: insecure_skip_verify: true static_configs: - - targets: [":8070"] # For v2, use [":8070/v2"] + - targets: [":8070"] ``` 1. Set up your Prometheus and Grafana servers. From 5b0413dc47e5adf5d733e9423f8b3b1febfdc454 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 7 Mar 2025 15:21:14 -0500 Subject: [PATCH 4/9] DOC-4800 Fix RS monitoring relrefs --- content/embeds/rs-observability.md | 2 +- content/embeds/rs-prometheus-grafana-quickstart.md | 2 +- .../7.4.6/re-clusters/connect-prometheus-operator.md | 2 +- .../kubernetes/re-clusters/connect-prometheus-operator.md | 2 +- content/operate/rs/clusters/configure/cluster-settings.md | 2 +- content/operate/rs/clusters/remove-node.md | 2 +- content/operate/rs/databases/configure/_index.md | 4 ++-- content/operate/rs/databases/memory-performance/_index.md | 2 +- .../operate/rs/databases/memory-performance/memory-limit.md | 2 +- .../operate/rs/references/compatibility/commands/server.md | 2 +- content/operate/rs/references/metrics/_index.md | 2 +- content/operate/rs/security/certificates/_index.md | 2 +- 12 files changed, 13 insertions(+), 13 deletions(-) diff --git a/content/embeds/rs-observability.md b/content/embeds/rs-observability.md index 965b92859d..b11eec90d8 100644 --- a/content/embeds/rs-observability.md +++ b/content/embeds/rs-observability.md @@ -35,7 +35,7 @@ In addition to manually monitoring these resources and indicators, it is best pr Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. -The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. See [Monitoring with metrics and alerts]({{}}) for more details. +The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. See [Monitoring with metrics and alerts]({{}}) for more details. If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. You can scrape both existing and new endpoints simultaneously, which lets you create advanced dashboards and transition smoothly. diff --git a/content/embeds/rs-prometheus-grafana-quickstart.md b/content/embeds/rs-prometheus-grafana-quickstart.md index 5a51dbe22b..f8472750da 100644 --- a/content/embeds/rs-prometheus-grafana-quickstart.md +++ b/content/embeds/rs-prometheus-grafana-quickstart.md @@ -8,7 +8,7 @@ Metrics are exposed at the cluster, node, database, shard, and proxy levels. - [Grafana](https://grafana.com/) is an open source metrics visualization tool that processes Prometheus data. You can use Prometheus and Grafana to: -- Collect and display metrics not available in the [admin console]({{< relref "/operate/rs/references/metrics" >}}) +- Collect and display metrics not available in the admin console - Set up automatic alerts for node or cluster events diff --git a/content/operate/kubernetes/7.4.6/re-clusters/connect-prometheus-operator.md b/content/operate/kubernetes/7.4.6/re-clusters/connect-prometheus-operator.md index 16f28048aa..3866f5d677 100644 --- a/content/operate/kubernetes/7.4.6/re-clusters/connect-prometheus-operator.md +++ b/content/operate/kubernetes/7.4.6/re-clusters/connect-prometheus-operator.md @@ -69,4 +69,4 @@ For more info about configuring the `ServiceMonitor` resource, see the [`Service - [Troubleshooting ServiceMonitor changes](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md) - redis.io/docs - [Metrics in Prometheus]({{< relref "/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions" >}}) - - [Monitoring and metrics]({{< relref "/operate/rs/clusters/monitoring/" >}}) + - [Monitoring and metrics]({{< relref "/operate/rs/monitoring/" >}}) diff --git a/content/operate/kubernetes/re-clusters/connect-prometheus-operator.md b/content/operate/kubernetes/re-clusters/connect-prometheus-operator.md index 4a83151ff0..4306968123 100644 --- a/content/operate/kubernetes/re-clusters/connect-prometheus-operator.md +++ b/content/operate/kubernetes/re-clusters/connect-prometheus-operator.md @@ -68,4 +68,4 @@ For more info about configuring the `ServiceMonitor` resource, see the [`Service - [Troubleshooting ServiceMonitor changes](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md) - redis.io/docs - [Metrics in Prometheus]({{< relref "/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions" >}}) - - [Monitoring and metrics]({{< relref "/operate/rs/clusters/monitoring/" >}}) + - [Monitoring and metrics]({{< relref "/operate/rs/monitoring/" >}}) diff --git a/content/operate/rs/clusters/configure/cluster-settings.md b/content/operate/rs/clusters/configure/cluster-settings.md index 8f400ce721..f659d091e6 100644 --- a/content/operate/rs/clusters/configure/cluster-settings.md +++ b/content/operate/rs/clusters/configure/cluster-settings.md @@ -43,7 +43,7 @@ You can change the **Time zone** field to ensure the date, time fields, and log The **Alert Settings** tab lets you configure alerts that are relevant to the entire cluster, such as alerts for cluster utilization, nodes, node utilization, security, and database utilization. -You can also configure email server settings and [send alerts by email]({{< relref "/operate/rs/clusters/monitoring#send-alerts-by-email" >}}) to relevant users. +You can also configure email server settings and [send alerts by email]({{< relref "/operate/rs/monitoring/v1_monitoring#send-alerts-by-email" >}}) to relevant users. ### Configure email server settings diff --git a/content/operate/rs/clusters/remove-node.md b/content/operate/rs/clusters/remove-node.md index 3748e78331..2a42f650eb 100644 --- a/content/operate/rs/clusters/remove-node.md +++ b/content/operate/rs/clusters/remove-node.md @@ -15,7 +15,7 @@ You might want to remove a node from a Redis Enterprise cluster for one of the f - To [replace a faulty node](#replace-a-faulty-node) with a healthy node. - To [replace a healthy node](#replace-a-healthy-node) with a different node. -You can configure [email alerts from the cluster]({{< relref "/operate/rs/clusters/monitoring#cluster-alerts" >}}) to notify you of cluster changes, including when a node is removed. +You can configure [email alerts from the cluster]({{< relref "/operate/rs/monitoring/v1_monitoring#cluster-alerts" >}}) to notify you of cluster changes, including when a node is removed. {{}} Read through these explanations thoroughly before taking diff --git a/content/operate/rs/databases/configure/_index.md b/content/operate/rs/databases/configure/_index.md index 914b7f2013..da0c5d4bdc 100644 --- a/content/operate/rs/databases/configure/_index.md +++ b/content/operate/rs/databases/configure/_index.md @@ -179,9 +179,9 @@ You can require [**TLS**]({{< relref "/operate/rs/security/encryption/tls/" >}}) ### Alerts -Select [alerts]({{}}) to show in the database status and configure their thresholds. +Select [alerts]({{}}) to show in the database status and configure their thresholds. -You can also choose to [send alerts by email]({{}}) to relevant users. +You can also choose to [send alerts by email]({{}}) to relevant users. ### Replica Of diff --git a/content/operate/rs/databases/memory-performance/_index.md b/content/operate/rs/databases/memory-performance/_index.md index 8edcd54e1f..c819fc8ed5 100644 --- a/content/operate/rs/databases/memory-performance/_index.md +++ b/content/operate/rs/databases/memory-performance/_index.md @@ -67,7 +67,7 @@ From the Redis Enterprise Software Cluster Manager UI, you can monitor the perfo With the Redis Enterprise Software API, you can also integrate Redis Enterprise metrics into other monitoring environments, such as Prometheus. -For more info about monitoring with Redis Enterprise Software, see [Monitoring with metrics and alerts]({{< relref "/operate/rs/clusters/monitoring/_index.md" >}}), and [Memory statistics]({{< relref "/operate/rs/databases/memory-performance/memory-limit#memory-metrics" >}}). +For more info about monitoring with Redis Enterprise Software, see [Monitoring with metrics and alerts]({{< relref "/operate/rs/monitoring" >}}), and [Memory statistics]({{< relref "/operate/rs/databases/memory-performance/memory-limit#memory-metrics" >}}). ## Scaling databases diff --git a/content/operate/rs/databases/memory-performance/memory-limit.md b/content/operate/rs/databases/memory-performance/memory-limit.md index 34321425fd..2d539d29e2 100644 --- a/content/operate/rs/databases/memory-performance/memory-limit.md +++ b/content/operate/rs/databases/memory-performance/memory-limit.md @@ -64,7 +64,7 @@ out of memory (OOM) messages. 4. If shards can't free memory, Redis Enterprise relies on the OS processes to stop replicas, but tries to avoid stopping primary shards. -We recommend that you have a [monitoring platform]({{< relref "/operate/rs/clusters/monitoring/" >}}) that alerts you before a system gets low on RAM. +We recommend that you have a [monitoring platform]({{< relref "/operate/rs/monitoring/" >}}) that alerts you before a system gets low on RAM. You must maintain sufficient free memory to make sure that you have a healthy Redis Enterprise installation. ## Memory metrics diff --git a/content/operate/rs/references/compatibility/commands/server.md b/content/operate/rs/references/compatibility/commands/server.md index fa9bada4a6..b47faec196 100644 --- a/content/operate/rs/references/compatibility/commands/server.md +++ b/content/operate/rs/references/compatibility/commands/server.md @@ -83,7 +83,7 @@ Redis Cloud manages modules for you and lets you [enable modules]({{< relref "/o ## Monitoring commands -Although Redis Enterprise does not support certain monitoring commands, you can use the Cluster Manager UI to view Redis Enterprise Software [metrics]({{< relref "/operate/rs/clusters/monitoring" >}}) and [logs]({{< relref "/operate/rs/clusters/logging" >}}) or the Redis Cloud console to view Redis Cloud [metrics]({{< relref "/operate/rc/databases/monitor-performance" >}}) and [logs]({{< relref "/operate/rc/logs-reports/system-logs" >}}). +Although Redis Enterprise does not support certain monitoring commands, you can use the Cluster Manager UI to view Redis Enterprise Software [metrics]({{< relref "/operate/rs/monitoring" >}}) and [logs]({{< relref "/operate/rs/clusters/logging" >}}) or the Redis Cloud console to view Redis Cloud [metrics]({{< relref "/operate/rc/databases/monitor-performance" >}}) and [logs]({{< relref "/operate/rc/logs-reports/system-logs" >}}). | Command | Redis
Enterprise | Redis
Cloud | Notes | |:--------|:----------------------|:-----------------|:------| diff --git a/content/operate/rs/references/metrics/_index.md b/content/operate/rs/references/metrics/_index.md index e8fe0d5dfa..e873fdcead 100644 --- a/content/operate/rs/references/metrics/_index.md +++ b/content/operate/rs/references/metrics/_index.md @@ -12,7 +12,7 @@ linkTitle: Metrics weight: $weight --- -In the Redis Enterprise Cluster Manager UI, you can see real-time performance metrics for clusters, nodes, databases, and shards, and configure alerts that send notifications based on alert parameters. Select the **Metrics** tab to view the metrics for each component. For more information, see [Monitoring with metrics and alerts]({{< relref "/operate/rs/clusters/monitoring" >}}). +In the Redis Enterprise Cluster Manager UI, you can see real-time performance metrics for clusters, nodes, databases, and shards, and configure alerts that send notifications based on alert parameters. Select the **Metrics** tab to view the metrics for each component. For more information, see [Monitoring with metrics and alerts]({{< relref "/operate/rs/monitoring" >}}). See the following topics for metrics definitions: - [Database operations]({{< relref "/operate/rs/references/metrics/database-operations" >}}) for database metrics diff --git a/content/operate/rs/security/certificates/_index.md b/content/operate/rs/security/certificates/_index.md index 31b361d34e..f1deb1436a 100644 --- a/content/operate/rs/security/certificates/_index.md +++ b/content/operate/rs/security/certificates/_index.md @@ -20,7 +20,7 @@ Here's the list of self-signed certificates that create secure, encrypted connec | `api` | Encrypts [REST API]({{< relref "/operate/rs/references/rest-api/" >}}) requests and responses. | | `cm` | Secures connections to the Redis Enterprise Cluster Manager UI. | | `ldap_client` | Secures connections between LDAP clients and LDAP servers. | -| `metrics_exporter` | Sends Redis Enterprise metrics to external [monitoring tools]({{< relref "/operate/rs/clusters/monitoring/" >}}) over a secure connection. | +| `metrics_exporter` | Sends Redis Enterprise metrics to external [monitoring tools]({{< relref "/operate/rs/monitoring/" >}}) over a secure connection. | | `proxy` | Creates secure, encrypted connections between clients and databases. | | `syncer` | For [Active-Active]({{< relref "/operate/rs/databases/active-active/" >}}) or [Replica Of]({{< relref "/operate/rs/databases/import-export/replica-of/" >}}) databases, encrypts data during the synchronization of participating clusters. | From f8dd88dcb260cf379e25502fc35f82fbdc0447a8 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 7 Mar 2025 16:12:16 -0500 Subject: [PATCH 5/9] DOC-4800 Copy edits and adding/fixing links --- content/embeds/rs-observability.md | 2 +- content/operate/rs/monitoring/_index.md | 9 ++++---- .../rs/monitoring/metrics_stream_engine.md | 23 +++++++++++++++---- .../operate/rs/references/metrics/_index.md | 4 ++-- .../metrics/prometheus-metrics-v1-to-v2.md | 2 +- 5 files changed, 27 insertions(+), 13 deletions(-) diff --git a/content/embeds/rs-observability.md b/content/embeds/rs-observability.md index b11eec90d8..d4b765cdd0 100644 --- a/content/embeds/rs-observability.md +++ b/content/embeds/rs-observability.md @@ -37,7 +37,7 @@ Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream en The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. See [Monitoring with metrics and alerts]({{}}) for more details. -If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. You can scrape both existing and new endpoints simultaneously, which lets you create advanced dashboards and transition smoothly. +If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. You can scrape both existing and new endpoints simultaneously, which lets you create advanced dashboards and transition smoothly. ### Memory diff --git a/content/operate/rs/monitoring/_index.md b/content/operate/rs/monitoring/_index.md index 27789758ab..42952af901 100644 --- a/content/operate/rs/monitoring/_index.md +++ b/content/operate/rs/monitoring/_index.md @@ -20,14 +20,13 @@ to monitor the performance of your databases. In the Redis Enterprise Cluster Manager UI, you can view metrics, configure alerts, and send notifications based on alert parameters. You can also access metrics and configure alerts through the REST API. -## Metrics stream engine preview +See [Metrics and alerts for monitoring v1]({{}}) for more information. -Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. -This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. +## Metrics stream engine preview -The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. +A preview of the new metrics stream engine is available as of [Redis Enterprise Software version 7.8.2]({{}}). This new engine exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`, exports all time-series metrics to external monitoring tools, and enables real-time monitoring. -If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. +See [Metrics stream engine preview for monitoring v2]({{}}) for more information. ## Integrate with external monitoring tools diff --git a/content/operate/rs/monitoring/metrics_stream_engine.md b/content/operate/rs/monitoring/metrics_stream_engine.md index fa4d5c9f11..8305752ed9 100644 --- a/content/operate/rs/monitoring/metrics_stream_engine.md +++ b/content/operate/rs/monitoring/metrics_stream_engine.md @@ -12,11 +12,26 @@ linkTitle: Metrics stream engine preview for monitoring v2 weight: 60 --- -The latest approach to monitoring Redis Enterprise Software clusters, nodes, databases, and shards no longer includes the internal monitoring systems like the stats API and Cluster Manager metrics and alerts. Instead, you can use the v2 Prometheus scraping endpoint to integrate external monitoring tools such as Prometheus and Grafana +A preview of the new metrics stream engine is available as of [Redis Enterprise Software version 7.8.2]({{}}). -Redis Enterprise version 7.8.2 introduces a preview of the new metrics stream engine that exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. -This new engine exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. +The new metrics stream engine: -The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. +- Exposes the v2 Prometheus scraping endpoint at `https://:8070/v2`. + +- Exports all time-series metrics to external monitoring tools such as Grafana, DataDog, NewRelic, and Dynatrace using Prometheus. + +- Enables real-time monitoring, including full monitoring during maintenance operations, which provides full visibility into performance during events such as shards' failovers and scaling operations. + +## Integrate with external monitoring tools + +To integrate Redis Enterprise metrics into your monitoring environment, see the integration guides for [Prometheus and Grafana]({{< relref "/operate/rs/monitoring/prometheus_and_grafana" >}}). + +Filter [Libraries and tools]({{}}) by "observability" for additional tools and guides. + +## Prometheus metrics v2 + +For a list of all available v2 metrics, see [Prometheus metrics v2]({{}}). + +## Transition from Prometheus v1 to Prometheus v2 If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. diff --git a/content/operate/rs/references/metrics/_index.md b/content/operate/rs/references/metrics/_index.md index e873fdcead..be4cdcabbf 100644 --- a/content/operate/rs/references/metrics/_index.md +++ b/content/operate/rs/references/metrics/_index.md @@ -12,6 +12,8 @@ linkTitle: Metrics weight: $weight --- +## Cluster Manager metrics + In the Redis Enterprise Cluster Manager UI, you can see real-time performance metrics for clusters, nodes, databases, and shards, and configure alerts that send notifications based on alert parameters. Select the **Metrics** tab to view the metrics for each component. For more information, see [Monitoring with metrics and alerts]({{< relref "/operate/rs/monitoring" >}}). See the following topics for metrics definitions: @@ -31,8 +33,6 @@ This new engine exports all time-series metrics to external monitoring tools suc The new engine enables real-time monitoring, including full monitoring during maintenance operations, providing full visibility into performance during events such as shards' failovers and scaling operations. -If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. - For a list of available metrics, see the following references: - [Prometheus metrics v1]({{}}) diff --git a/content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md b/content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md index a7222c6e0f..0dc8acfcb1 100644 --- a/content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md +++ b/content/operate/rs/references/metrics/prometheus-metrics-v1-to-v2.md @@ -16,6 +16,6 @@ tocEmbedHeaders: true You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}) to create dashboards for important metrics. -As of Redis Enterprise Software version 7.8.2, [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/) metrics are available. V1 metrics are deprecated but still available. You can use the following tables to transition from v1 metrics to equivalent v2 PromQL. For a list of all available v2 PromQL metrics, see [Prometheus metrics v2]({{}}). +As of Redis Enterprise Software version 7.8.2, [PromQL (Prometheus Query Language)](https://prometheus.io/docs/prometheus/latest/querying/basics/) metrics are available. V1 metrics are deprecated but still available. You can use the following tables to transition from v1 metrics to equivalent v2 PromQL. For a list of all available v2 metrics, see [Prometheus metrics v2]({{}}). {{}} From b04d0b863f549c597738bc70733336032df56324 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 7 Mar 2025 16:37:40 -0500 Subject: [PATCH 6/9] DOC-4800 More copy edits --- .../rs/monitoring/metrics_stream_engine.md | 2 +- content/operate/rs/monitoring/v1_monitoring.md | 15 ++++++++++----- content/operate/rs/references/metrics/_index.md | 2 +- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/content/operate/rs/monitoring/metrics_stream_engine.md b/content/operate/rs/monitoring/metrics_stream_engine.md index 8305752ed9..e2e179a44e 100644 --- a/content/operate/rs/monitoring/metrics_stream_engine.md +++ b/content/operate/rs/monitoring/metrics_stream_engine.md @@ -6,7 +6,7 @@ categories: - operate - rs - kubernetes -description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. +description: Preview the new metrics stream engine for monitoring Redis Enterprise Software. hideListLinks: true linkTitle: Metrics stream engine preview for monitoring v2 weight: 60 diff --git a/content/operate/rs/monitoring/v1_monitoring.md b/content/operate/rs/monitoring/v1_monitoring.md index 5f29d3009d..0ab01ab5c2 100644 --- a/content/operate/rs/monitoring/v1_monitoring.md +++ b/content/operate/rs/monitoring/v1_monitoring.md @@ -6,21 +6,21 @@ categories: - operate - rs - kubernetes -description: Use the metrics that measure the performance of your Redis Enterprise Software clusters, nodes, databases, and shards to track the performance of your databases. +description: Monitor Redis Enterprise Software clusters and databases using internal monitoring systems and external monitoring tools. hideListLinks: true linkTitle: Monitoring v1 weight: 50 --- -The current approach to monitoring Redis Enterprise Software clusters, nodes, databases, and shards includes: +The current approach to monitoring Redis Enterprise Software includes: - Internal monitoring systems: - - All stats-api + - [Statistics APIs]({{}}), which collect various statistics at regular time intervals for clusters, nodes, databases, shards, and endpoints. - - Cluster Manager metrics and alerts + - Cluster manager metrics and alerts. -- The v1 Prometheus scraping endpoint to integrate with external monitoring tools such as Prometheus and Grafana. +- The v1 Prometheus scraping endpoint to integrate with external monitoring tools such as [Prometheus and Grafana]({{}}). ## Cluster manager metrics @@ -39,6 +39,11 @@ To choose which metrics to display in the two large graphs at the top of the pag We recommend that you show two similar metrics in the top graphs so you can compare them side-by-side. +See the following topics for metrics definitions: +- [Database operations]({{< relref "/operate/rs/references/metrics/database-operations" >}}) for database metrics +- [Resource usage]({{< relref "/operate/rs/references/metrics/resource-usage" >}}) for resource and database usage metrics +- [Auto Tiering]({{< relref "/operate/rs/references/metrics/auto-tiering" >}}) for additional metrics for [Auto Tiering ]({{< relref "/operate/rs/databases/auto-tiering" >}}) databases + ## Cluster alerts In **Cluster > Alert Settings**, you can enable alerts for node or cluster events, such as high memory usage or throughput. diff --git a/content/operate/rs/references/metrics/_index.md b/content/operate/rs/references/metrics/_index.md index be4cdcabbf..1ccd1ba692 100644 --- a/content/operate/rs/references/metrics/_index.md +++ b/content/operate/rs/references/metrics/_index.md @@ -12,7 +12,7 @@ linkTitle: Metrics weight: $weight --- -## Cluster Manager metrics +## Cluster manager metrics In the Redis Enterprise Cluster Manager UI, you can see real-time performance metrics for clusters, nodes, databases, and shards, and configure alerts that send notifications based on alert parameters. Select the **Metrics** tab to view the metrics for each component. For more information, see [Monitoring with metrics and alerts]({{< relref "/operate/rs/monitoring" >}}). From 14ce781b24c5e6372f76c409686f832b102dfac5 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Mon, 10 Mar 2025 12:34:10 -0400 Subject: [PATCH 7/9] DOC-4800 Feedback updates for monitoring quick start title and shard migration metrics limitation --- content/operate/rs/monitoring/prometheus_and_grafana.md | 4 ++-- content/operate/rs/references/metrics/_index.md | 6 ++++-- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/content/operate/rs/monitoring/prometheus_and_grafana.md b/content/operate/rs/monitoring/prometheus_and_grafana.md index f991def39f..428656fc7b 100644 --- a/content/operate/rs/monitoring/prometheus_and_grafana.md +++ b/content/operate/rs/monitoring/prometheus_and_grafana.md @@ -1,6 +1,6 @@ --- -LinkTitle: Prometheus and Grafana quick start -Title: Prometheus and Grafana with Redis Enterprise Software quick start +LinkTitle: Get started +Title: Get started with monitoring Redis Enterprise Software alwaysopen: false categories: - docs diff --git a/content/operate/rs/references/metrics/_index.md b/content/operate/rs/references/metrics/_index.md index 1ccd1ba692..9b5c0f7f6e 100644 --- a/content/operate/rs/references/metrics/_index.md +++ b/content/operate/rs/references/metrics/_index.md @@ -49,7 +49,7 @@ Metrics information is not shown for clusters with more than 128 shards. For lar ### Metrics not shown during shard migration -The following metrics are not measured during [shard migration]({{< relref "/operate/rs/databases/configure/replica-ha" >}}). If you view these metrics while resharding, the graph will be blank. +The following metrics are not measured during [shard migration]({{< relref "/operate/rs/databases/configure/replica-ha" >}}) when using the [internal monitoring systems]({{}}). If you view these metrics while resharding, the graph will be blank. - [Evicted objects/sec]({{< relref "/operate/rs/references/metrics/database-operations#evicted-objectssec" >}}) - [Expired objects/sec]({{< relref "/operate/rs/references/metrics/database-operations#expired-objectssec" >}}) @@ -58,4 +58,6 @@ The following metrics are not measured during [shard migration]({{< relref "/ope - [Total keys]({{< relref "/operate/rs/references/metrics/database-operations#total-keys" >}}) - [Incoming traffic]({{< relref "/operate/rs/references/metrics/resource-usage#incoming-traffic" >}}) - [Outgoing traffic]({{< relref "/operate/rs/references/metrics/resource-usage#outgoing-traffic" >}}) -- [Used memory]({{< relref "/operate/rs/references/metrics/resource-usage#used-memory" >}}) \ No newline at end of file +- [Used memory]({{< relref "/operate/rs/references/metrics/resource-usage#used-memory" >}}) + +This limitation does not apply to the new [metrics stream engine]({{}}). From 13c31f0329f3a611e2ad37fcec6dc24cd7faff7d Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Wed, 19 Mar 2025 11:07:52 -0500 Subject: [PATCH 8/9] DOC-4800 Mentioned node-exporter metrics for v2 scraping endpoint and added link --- .../prometheus-metrics-definitions.md | 2 ++ content/operate/rs/monitoring/metrics_stream_engine.md | 2 ++ content/operate/rs/references/metrics/prometheus-metrics-v2.md | 2 ++ 3 files changed, 6 insertions(+) diff --git a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md index 91d02ba868..7673d7dd36 100644 --- a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md +++ b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md @@ -22,4 +22,6 @@ You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}). +The v2 scraping endpoint also exposes metrics for `node-exporter` version 1.8.1. For more information, see [Common metrics of node-exporter](https://docs.byteplus.com/en/docs/vmp/Common-metrics-of-node-exporter). + {{}} diff --git a/content/operate/rs/monitoring/metrics_stream_engine.md b/content/operate/rs/monitoring/metrics_stream_engine.md index e2e179a44e..a6d5d71062 100644 --- a/content/operate/rs/monitoring/metrics_stream_engine.md +++ b/content/operate/rs/monitoring/metrics_stream_engine.md @@ -32,6 +32,8 @@ Filter [Libraries and tools]({{}}) by "observability" for a For a list of all available v2 metrics, see [Prometheus metrics v2]({{}}). +The v2 scraping endpoint also exposes metrics for `node-exporter` version 1.8.1. For more information, see [Common metrics of node-exporter](https://docs.byteplus.com/en/docs/vmp/Common-metrics-of-node-exporter). + ## Transition from Prometheus v1 to Prometheus v2 If you are already using the existing scraping endpoint for integration, follow [this guide]({{}}) to transition and try the new engine. It is possible to scrape both existing and new endpoints simultaneously, allowing advanced dashboard preparation and a smooth transition. diff --git a/content/operate/rs/references/metrics/prometheus-metrics-v2.md b/content/operate/rs/references/metrics/prometheus-metrics-v2.md index be60b615b3..d85f883ffc 100644 --- a/content/operate/rs/references/metrics/prometheus-metrics-v2.md +++ b/content/operate/rs/references/metrics/prometheus-metrics-v2.md @@ -22,4 +22,6 @@ You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}). +The v2 scraping endpoint also exposes metrics for `node-exporter` version 1.8.1. For more information, see [Common metrics of node-exporter](https://docs.byteplus.com/en/docs/vmp/Common-metrics-of-node-exporter). + {{}} From e9fb0c047907db7594171ce9f3810618f457f783 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Wed, 19 Mar 2025 13:19:54 -0500 Subject: [PATCH 9/9] DOC-4800 Fixed links for node_exporter metrics --- .../prometheus-metrics-definitions.md | 2 +- content/operate/rs/monitoring/metrics_stream_engine.md | 2 +- content/operate/rs/references/metrics/prometheus-metrics-v2.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md index 7673d7dd36..1cf3fb5bb4 100644 --- a/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md +++ b/content/integrate/prometheus-with-redis-enterprise/prometheus-metrics-definitions.md @@ -22,6 +22,6 @@ You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}). -The v2 scraping endpoint also exposes metrics for `node-exporter` version 1.8.1. For more information, see [Common metrics of node-exporter](https://docs.byteplus.com/en/docs/vmp/Common-metrics-of-node-exporter). +The v2 scraping endpoint also exposes metrics for `node_exporter` version 1.8.1. For more information, see the [Prometheus node_exporter GitHub repository](https://github.com/prometheus/node_exporter). {{}} diff --git a/content/operate/rs/monitoring/metrics_stream_engine.md b/content/operate/rs/monitoring/metrics_stream_engine.md index a6d5d71062..6a799e229d 100644 --- a/content/operate/rs/monitoring/metrics_stream_engine.md +++ b/content/operate/rs/monitoring/metrics_stream_engine.md @@ -32,7 +32,7 @@ Filter [Libraries and tools]({{}}) by "observability" for a For a list of all available v2 metrics, see [Prometheus metrics v2]({{}}). -The v2 scraping endpoint also exposes metrics for `node-exporter` version 1.8.1. For more information, see [Common metrics of node-exporter](https://docs.byteplus.com/en/docs/vmp/Common-metrics-of-node-exporter). +The v2 scraping endpoint also exposes metrics for `node_exporter` version 1.8.1. For more information, see the [Prometheus node_exporter GitHub repository](https://github.com/prometheus/node_exporter). ## Transition from Prometheus v1 to Prometheus v2 diff --git a/content/operate/rs/references/metrics/prometheus-metrics-v2.md b/content/operate/rs/references/metrics/prometheus-metrics-v2.md index d85f883ffc..5d7c064bec 100644 --- a/content/operate/rs/references/metrics/prometheus-metrics-v2.md +++ b/content/operate/rs/references/metrics/prometheus-metrics-v2.md @@ -22,6 +22,6 @@ You can [integrate Redis Enterprise Software with Prometheus and Grafana]({{}}). -The v2 scraping endpoint also exposes metrics for `node-exporter` version 1.8.1. For more information, see [Common metrics of node-exporter](https://docs.byteplus.com/en/docs/vmp/Common-metrics-of-node-exporter). +The v2 scraping endpoint also exposes metrics for `node_exporter` version 1.8.1. For more information, see the [Prometheus node_exporter GitHub repository](https://github.com/prometheus/node_exporter). {{}}