You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The appropriate memory threshold depends on how the application is using Redis.
63
63
64
64
* Caching workloads, which permit Redis to evict keys, can safely use 100% of available memory.
65
65
* Non-caching workloadsdo not permit key eviction and should be closely monitored as soon as memory usage reaches 80%.
66
66
67
-
####Caching workloads
67
+
### Caching workloads
68
68
69
69
For applications using Redis solely as a cache, you can safely let the memory usage
70
70
reach 100% as long as you have an [eviction policy](https://redis.io/blog/cache-eviction-strategies/)in place. This will ensure
@@ -79,7 +79,7 @@ it's still important to monitor performance. The key performance indicators incl
79
79
* Cache hit ratio
80
80
* Evicted keys
81
81
82
-
####Read latency
82
+
### Read latency
83
83
84
84
**Latency** has two important definitions, depending on context:
85
85
@@ -93,7 +93,7 @@ This may indicate a low cache hit ratio, ultimately caused by insufficient memor
93
93
You need to monitor both application-level and Redis-level latency to diagnose
94
94
caching performance issues in production.
95
95
96
-
####Cache hit ratio and eviction
96
+
### Cache hit ratio and eviction
97
97
98
98
**Cache hit ratio** is the percentage of read requests that Redis serves successfully.
99
99
**Eviction rate** is the rate at which Redis evicts keys from the cache. These metrics
@@ -123,7 +123,7 @@ An acceptable rate of key evictions depends on the total number of keys in the d
123
123
and the measure of application-level latency. If application latency is high,
124
124
check to see that key evictions have not increased.
125
125
126
-
####Eviction Policies
126
+
### Eviction Policies
127
127
128
128
| Name | Description |
129
129
| ------ | :------ |
@@ -137,7 +137,7 @@ check to see that key evictions have not increased.
137
137
|volatile-ttl | Removes keys with expire field set to true and the shortest remaining time-to-live (TTL) value. |
138
138
139
139
140
-
####Eviction policy guidelines
140
+
### Eviction policy guidelines
141
141
142
142
* Use the allkeys-lru policy when you expect a power-law distribution in the popularity of your requests. That is, you expect a subset of elements will be accessed far more often than the rest. This is a good pick if you are unsure.
143
143
@@ -149,13 +149,13 @@ The volatile-lru and volatile-random policies are mainly useful when you want to
149
149
150
150
**NB** Setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient since there is no need for an expire configuration for the key to be evicted under memory pressure.
151
151
152
-
####Non-caching workloads
152
+
### Non-caching workloads
153
153
154
154
If no eviction policy is enabled, then Redis will stop accepting writes once memory reaches 100%.
155
155
Therefore, for non-caching workloads, we recommend that you configure an alert at 80% memory usage.
156
156
Once your database reaches this 80% threshold, you should closely review the rate of memory usage growth.
157
157
158
-
####Troubleshooting
158
+
### Troubleshooting
159
159
160
160
|Issue | Possible causes | Remediation |
161
161
| ------ | ------ | :------ |
@@ -165,7 +165,7 @@ Once your database reaches this 80% threshold, you should closely review the rat
165
165
166
166
167
167
168
-
###CPU
168
+
## CPU
169
169
170
170
Redis Enterprise provides several CPU metrics:
171
171
@@ -188,7 +188,7 @@ When diagnosing performance issues, start by looking at shard CPU.
188
188
Dashboard displaying CPU usage - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json)
189
189
{{< image filename="/images/playbook_database-cpu-shard.png" alt="Dashboard displaying CPU usage" >}}
190
190
191
-
####Thresholds
191
+
### Thresholds
192
192
193
193
In general, we define high CPU as any CPU utilization above 80% of total capacity.
194
194
@@ -209,7 +209,7 @@ Dashboard displaying an ensemble of Node CPU usage data - [Node Dashboard](https
209
209
Node CPU should also remain below 80% of total capacity. As with the proxy, the node CPU is variable depending
210
210
on the CPU capacity of the node. You will need to calibrate your alerting based on the number of cores in your nodes.
211
211
212
-
####Troubleshooting
212
+
### Troubleshooting
213
213
214
214
High CPU utilization has multiple possible causes. Common causes include an under-provisioned cluster,
215
215
excess inefficient Redis operations, and hot master shards.
@@ -222,7 +222,7 @@ excess inefficient Redis operations, and hot master shards.
222
222
|High Node CPU | You will typically detect high shard or proxy CPU utilization before you detect high node CPU utilization. | Use the remediation steps above to address high shard and proxy CPU utilization. In spite of this, if you see high node CPU utilization, you may need to increase the number of nodes in the cluster. Consider increasing the number of nodes in the cluster and the rebalancing the shards across the new nodes. This is a complex operation and should be done with the help of Redis support. |
223
223
|High System CPU | Most of the issues above will reflect user-space CPU utilization. However, if you see high system CPU utilization, this may indicate a problem at the network or storage level. | Review network bytes in and network bytes out to rule out any unexpected spikes in network traffic. You may need perform some deeper network diagnostics to identify the cause of the high system CPU utilization. For example, with high rates of packet loss, you may need to review network configurations or even the network hardware. |
224
224
225
-
###Connections
225
+
## Connections
226
226
227
227
The Redis Enterprise database dashboard indicates to the total number of connections to the database.
228
228
@@ -231,7 +231,7 @@ Based on the number of application instances connecting to Redis (and whether yo
231
231
you should have a rough idea of the minimum and maximum number of connections you expect to see for any given database.
232
232
This number should remain relatively constant over time.
233
233
234
-
####Troubleshooting
234
+
### Troubleshooting
235
235
236
236
| Issue | Possible causes | Remediation |
237
237
| ------ | ------ | :------ |
@@ -252,14 +252,14 @@ connection pool or a connection pool that is not properly configured. |
The network ingress / egress panel show the amount of data being sent to and received from the database.
258
258
Large spikes in network traffic can indicate that the cluster is under-provisioned or that
259
259
the application is reading and/or writing unusually large keys. A correlation between high network traffic
260
260
and high CPU utilization may indicate a large key scenario.
261
261
262
-
#####Unbalanced database endpoint
262
+
#### Unbalanced database endpoint
263
263
264
264
One possible cause is that the database endpoint is not located on the same node as master shards. In addition to added network latency, if data plane internode encryption is enabled, CPU consumption can increase as well.
265
265
@@ -268,7 +268,7 @@ One solution is to used the optimal shard placement and proxy policy to ensure e
268
268
Extreme network traffic utilization may approach the limits of the underlying network infrastructure.
269
269
In this case, the only remediation is to add additional nodes to the cluster and scale the database's shards across them.
270
270
271
-
###Synchronization
271
+
## Synchronization
272
272
273
273
In Redis Enterprise, geographically-distributed synchronization is based on CRDT technology.
274
274
The Redis Enterprise implementation of CRDT is called an Active-Active database (formerly known as CRDB).
@@ -320,7 +320,7 @@ which is why it's essential to measure request latency at the application, as we
320
320
Display showing a noticeable spike in latency
321
321
{{< image filename="/images/latency_spike.png)" alt="Display showing a noticeable spike in latency" >}}
322
322
323
-
####Troubleshooting
323
+
### Troubleshooting
324
324
325
325
Here are some possible causes of high database latency. Note that high database latency is just one possible
326
326
cause of high application latency. Application latency can be caused by a variety of factors, including
@@ -345,7 +345,7 @@ to determine if increased traffic is causing the latency. |
345
345
|Confirm that [slow operations](#Slow operations) are not causing the high CPU utilization.
346
346
If the high CPU utilization is due to increased load, consider adding shards to the database. |
347
347
348
-
###Cache hit rate
348
+
## Cache hit rate
349
349
350
350
**Cache hit rate** is the percentage of all read operations that return a response.footnote:[Cache hit rate is a composite statistic that is computed by dividing the number of read hits by the total number of read operations.]
351
351
When an application tries to read a key that exists, this is known as a **cache hit**.
@@ -366,7 +366,7 @@ bdb_read_misses - The number of read operations returning null
366
366
bdb_write_hits - The number of write operations against existing keys
367
367
bdb_write_misses - The number of write operations that create new keys
368
368
369
-
####Troubleshooting
369
+
### Troubleshooting
370
370
371
371
Cache hit rate is usually only relevant for caching workloads. Eeviction will begin after the database approaches its max memory capacity.
372
372
@@ -376,7 +376,7 @@ if the rate of necessary key evictions exceeds the rate of new key insertions.
376
376
See Cache hit ratio and eviction
377
377
for tips on troubleshooting cache hit rate.
378
378
379
-
###Key eviction rate
379
+
## Key eviction rate
380
380
381
381
They **key eviction rate** is rate at which objects are being evicted from the database.
382
382
See (https://redis.io/docs/latest/operate/rs/databases/memory-performance/eviction-policy/)[eviction policy] for a discussion if key eviction and its relationship with memory usage.
@@ -438,7 +438,7 @@ There are three data access patterns that can limit the performance of your Redi
438
438
439
439
This section defines each of these patterns and describes how to diagnose and mitigate them.
440
440
441
-
###Slow operations
441
+
## Slow operations
442
442
443
443
**Slow operations** are operations that take longer than a few milliseconds to complete.
444
444
@@ -487,15 +487,15 @@ or more to complete
487
487
|This likely indicates that the database is underprovisioned. Consider increasing the number of shards and/or nodes. |
488
488
489
489
490
-
###Hot keys
490
+
## Hot keys
491
491
492
492
A **hot key** is a key that is accessed extremely frequently (for example, thousands of times a second or more).
493
493
494
494
Each key in Redis belongs to one, and only one, shard.
495
495
For this reason, a hot key can cause high CPU utilization on that one shard,
496
496
which can increase latency for all other operations.
497
497
498
-
####Troubleshooting
498
+
### Troubleshooting
499
499
500
500
You may suspect that you have a hot key if you see high CPU utilization on a single shard.
501
501
There are two main way to identify hot keys: using the Redis CLI and sampling the operations against Redis.
@@ -512,7 +512,7 @@ against the high CPU shard. Since this a potentially high-impact operation, you
512
512
use this technique as a secondary restort. For mission-critical databases, consider
513
513
contact Redis support for assistance.
514
514
515
-
####Remediation
515
+
### Remediation
516
516
517
517
Once you discover a hot key, you need to find a way to reduce the number of operations against it.
518
518
This means getting an understanding of the application's access pattern and the reasons for such frequently access.
@@ -521,19 +521,19 @@ If the hot key operations are read-only, then consider implementing an applicati
521
521
that fewer read request are sent to Redis. For example, even a local cache that expires every 5 seconds
522
522
can entirely eliminate a hot key issue.
523
523
524
-
###Large keys
524
+
## Large keys
525
525
526
526
**Large keys** are keys that are hundreds of kilobytes or larger.
527
527
High network traffic and high CPU utilization can be caused by large keys.
528
528
529
-
####Troubleshooting
529
+
### Troubleshooting
530
530
531
531
To identify large keys, you can sample the keyspace using the Redis CLI.
532
532
533
533
Run `+redis-cli --memkeys+` against your database to sample the keyspace in real time
534
534
and potentially identify the largest keys in your database.
535
535
536
-
####Remediation
536
+
### Remediation
537
537
538
538
Addressing a large key issues requires understanding why the application is creating large keys in the first place.
539
539
As such, it's difficult to provide general advice to solving this issue. Resolution often requires a change
@@ -551,7 +551,7 @@ To use these alerts, install [Prometheus Alertmanager](https://prometheus.io/doc
551
551
For a comprehensive guide to alerting with Prometheus and Grafana,
552
552
see the [Grafana blog post on the subject](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/).
553
553
554
-
###Configuring Prometheus
554
+
## Configuring Prometheus
555
555
556
556
To configure Prometheus for alerting, open the `prometheus.yml` configuration file.
557
557
@@ -592,7 +592,7 @@ The following is a list of alerts contained in the `alerts.yml` file. There are
592
592
- Not all Redis Enterprise deployments export all metrics
593
593
- Most metrics only alert if the specified trigger persists for a given duration
594
594
595
-
###List of alerts
595
+
## List of alerts
596
596
597
597
| Description | Trigger |
598
598
| ------ | :------ |
@@ -634,16 +634,16 @@ a holistic picture of your deployment.
634
634
635
635
There are two additional sets of dashboards for Redis Enterprise software that provide drill-down functionality: the workflow dashboards.
0 commit comments