Skip to content

Commit 9ff50d6

Browse files
Update observability.md
1 parent 42c8c2f commit 9ff50d6

File tree

1 file changed

+33
-33
lines changed

1 file changed

+33
-33
lines changed

content/integrate/prometheus-with-redis-enterprise/observability.md

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ In addition to manually monitoring these resources and indicators), it is best p
4747

4848
## Core cluster resource monitoring
4949

50-
### Memory
50+
## Memory
5151

5252
Every Redis Enterprise database has a maximum configured memory limit to ensure isolation
5353
in a multi-database cluster.
@@ -57,14 +57,14 @@ Memory usage percentage metric - Percentage of used memory relative to the confi
5757
Dashboard displaying high-level cluster metrics - [Cluster Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/cluster_dashboard_v9-11.json)
5858
{{< image filename="/images/playbook_used-memory.png" alt="Dashboard displaying high-level cluster metrics" >}}
5959

60-
#### Thresholds
60+
### Thresholds
6161

6262
The appropriate memory threshold depends on how the application is using Redis.
6363

6464
* Caching workloads, which permit Redis to evict keys, can safely use 100% of available memory.
6565
* Non-caching workloadsdo not permit key eviction and should be closely monitored as soon as memory usage reaches 80%.
6666

67-
#### Caching workloads
67+
### Caching workloads
6868

6969
For applications using Redis solely as a cache, you can safely let the memory usage
7070
reach 100% as long as you have an [eviction policy](https://redis.io/blog/cache-eviction-strategies/)in place. This will ensure
@@ -79,7 +79,7 @@ it's still important to monitor performance. The key performance indicators incl
7979
* Cache hit ratio
8080
* Evicted keys
8181

82-
#### Read latency
82+
### Read latency
8383

8484
**Latency** has two important definitions, depending on context:
8585

@@ -93,7 +93,7 @@ This may indicate a low cache hit ratio, ultimately caused by insufficient memor
9393
You need to monitor both application-level and Redis-level latency to diagnose
9494
caching performance issues in production.
9595

96-
#### Cache hit ratio and eviction
96+
### Cache hit ratio and eviction
9797

9898
**Cache hit ratio** is the percentage of read requests that Redis serves successfully.
9999
**Eviction rate** is the rate at which Redis evicts keys from the cache. These metrics
@@ -123,7 +123,7 @@ An acceptable rate of key evictions depends on the total number of keys in the d
123123
and the measure of application-level latency. If application latency is high,
124124
check to see that key evictions have not increased.
125125

126-
#### Eviction Policies
126+
### Eviction Policies
127127

128128
| Name | Description |
129129
| ------ | :------ |
@@ -137,7 +137,7 @@ check to see that key evictions have not increased.
137137
|volatile-ttl | Removes keys with expire field set to true and the shortest remaining time-to-live (TTL) value. |
138138

139139

140-
#### Eviction policy guidelines
140+
### Eviction policy guidelines
141141

142142
* Use the allkeys-lru policy when you expect a power-law distribution in the popularity of your requests. That is, you expect a subset of elements will be accessed far more often than the rest. This is a good pick if you are unsure.
143143

@@ -149,13 +149,13 @@ The volatile-lru and volatile-random policies are mainly useful when you want to
149149

150150
**NB** Setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient since there is no need for an expire configuration for the key to be evicted under memory pressure.
151151

152-
#### Non-caching workloads
152+
### Non-caching workloads
153153

154154
If no eviction policy is enabled, then Redis will stop accepting writes once memory reaches 100%.
155155
Therefore, for non-caching workloads, we recommend that you configure an alert at 80% memory usage.
156156
Once your database reaches this 80% threshold, you should closely review the rate of memory usage growth.
157157

158-
#### Troubleshooting
158+
### Troubleshooting
159159

160160
|Issue | Possible causes | Remediation |
161161
| ------ | ------ | :------ |
@@ -165,7 +165,7 @@ Once your database reaches this 80% threshold, you should closely review the rat
165165

166166

167167

168-
### CPU
168+
## CPU
169169

170170
Redis Enterprise provides several CPU metrics:
171171

@@ -188,7 +188,7 @@ When diagnosing performance issues, start by looking at shard CPU.
188188
Dashboard displaying CPU usage - [Database Dashboard](https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json)
189189
{{< image filename="/images/playbook_database-cpu-shard.png" alt="Dashboard displaying CPU usage" >}}
190190

191-
#### Thresholds
191+
### Thresholds
192192

193193
In general, we define high CPU as any CPU utilization above 80% of total capacity.
194194

@@ -209,7 +209,7 @@ Dashboard displaying an ensemble of Node CPU usage data - [Node Dashboard](https
209209
Node CPU should also remain below 80% of total capacity. As with the proxy, the node CPU is variable depending
210210
on the CPU capacity of the node. You will need to calibrate your alerting based on the number of cores in your nodes.
211211

212-
#### Troubleshooting
212+
### Troubleshooting
213213

214214
High CPU utilization has multiple possible causes. Common causes include an under-provisioned cluster,
215215
excess inefficient Redis operations, and hot master shards.
@@ -222,7 +222,7 @@ excess inefficient Redis operations, and hot master shards.
222222
|High Node CPU | You will typically detect high shard or proxy CPU utilization before you detect high node CPU utilization. | Use the remediation steps above to address high shard and proxy CPU utilization. In spite of this, if you see high node CPU utilization, you may need to increase the number of nodes in the cluster. Consider increasing the number of nodes in the cluster and the rebalancing the shards across the new nodes. This is a complex operation and should be done with the help of Redis support. |
223223
|High System CPU | Most of the issues above will reflect user-space CPU utilization. However, if you see high system CPU utilization, this may indicate a problem at the network or storage level. | Review network bytes in and network bytes out to rule out any unexpected spikes in network traffic. You may need perform some deeper network diagnostics to identify the cause of the high system CPU utilization. For example, with high rates of packet loss, you may need to review network configurations or even the network hardware. |
224224

225-
### Connections
225+
## Connections
226226

227227
The Redis Enterprise database dashboard indicates to the total number of connections to the database.
228228

@@ -231,7 +231,7 @@ Based on the number of application instances connecting to Redis (and whether yo
231231
you should have a rough idea of the minimum and maximum number of connections you expect to see for any given database.
232232
This number should remain relatively constant over time.
233233

234-
#### Troubleshooting
234+
### Troubleshooting
235235

236236
| Issue | Possible causes | Remediation |
237237
| ------ | ------ | :------ |
@@ -252,14 +252,14 @@ connection pool or a connection pool that is not properly configured. |
252252
Dashboard displaying connections - https://github.com/redis-field-engineering/redis-enterprise-observability/blob/main/grafana/dashboards/grafana_v9-11/software/classic/database_dashboard_v9-11.json[Database Dashboard]
253253
{{< image filename="/images/playbook_database-used-connections.png" alt="Dashboard displaying connections" >}}
254254

255-
#### Network ingress / egress
255+
### Network ingress / egress
256256

257257
The network ingress / egress panel show the amount of data being sent to and received from the database.
258258
Large spikes in network traffic can indicate that the cluster is under-provisioned or that
259259
the application is reading and/or writing unusually large keys. A correlation between high network traffic
260260
and high CPU utilization may indicate a large key scenario.
261261

262-
##### Unbalanced database endpoint
262+
#### Unbalanced database endpoint
263263

264264
One possible cause is that the database endpoint is not located on the same node as master shards. In addition to added network latency, if data plane internode encryption is enabled, CPU consumption can increase as well.
265265

@@ -268,7 +268,7 @@ One solution is to used the optimal shard placement and proxy policy to ensure e
268268
Extreme network traffic utilization may approach the limits of the underlying network infrastructure.
269269
In this case, the only remediation is to add additional nodes to the cluster and scale the database's shards across them.
270270

271-
### Synchronization
271+
## Synchronization
272272

273273
In Redis Enterprise, geographically-distributed synchronization is based on CRDT technology.
274274
The Redis Enterprise implementation of CRDT is called an Active-Active database (formerly known as CRDB).
@@ -320,7 +320,7 @@ which is why it's essential to measure request latency at the application, as we
320320
Display showing a noticeable spike in latency
321321
{{< image filename="/images/latency_spike.png)" alt="Display showing a noticeable spike in latency" >}}
322322

323-
#### Troubleshooting
323+
### Troubleshooting
324324

325325
Here are some possible causes of high database latency. Note that high database latency is just one possible
326326
cause of high application latency. Application latency can be caused by a variety of factors, including
@@ -345,7 +345,7 @@ to determine if increased traffic is causing the latency. |
345345
|Confirm that [slow operations](#Slow operations) are not causing the high CPU utilization.
346346
If the high CPU utilization is due to increased load, consider adding shards to the database. |
347347

348-
### Cache hit rate
348+
## Cache hit rate
349349

350350
**Cache hit rate** is the percentage of all read operations that return a response.footnote:[Cache hit rate is a composite statistic that is computed by dividing the number of read hits by the total number of read operations.]
351351
When an application tries to read a key that exists, this is known as a **cache hit**.
@@ -366,7 +366,7 @@ bdb_read_misses - The number of read operations returning null
366366
bdb_write_hits - The number of write operations against existing keys
367367
bdb_write_misses - The number of write operations that create new keys
368368

369-
#### Troubleshooting
369+
### Troubleshooting
370370

371371
Cache hit rate is usually only relevant for caching workloads. Eeviction will begin after the database approaches its max memory capacity.
372372

@@ -376,7 +376,7 @@ if the rate of necessary key evictions exceeds the rate of new key insertions.
376376
See Cache hit ratio and eviction
377377
for tips on troubleshooting cache hit rate.
378378

379-
### Key eviction rate
379+
## Key eviction rate
380380

381381
They **key eviction rate** is rate at which objects are being evicted from the database.
382382
See (https://redis.io/docs/latest/operate/rs/databases/memory-performance/eviction-policy/)[eviction policy] for a discussion if key eviction and its relationship with memory usage.
@@ -438,7 +438,7 @@ There are three data access patterns that can limit the performance of your Redi
438438

439439
This section defines each of these patterns and describes how to diagnose and mitigate them.
440440

441-
### Slow operations
441+
## Slow operations
442442

443443
**Slow operations** are operations that take longer than a few milliseconds to complete.
444444

@@ -487,15 +487,15 @@ or more to complete
487487
|This likely indicates that the database is underprovisioned. Consider increasing the number of shards and/or nodes. |
488488

489489

490-
### Hot keys
490+
## Hot keys
491491

492492
A **hot key** is a key that is accessed extremely frequently (for example, thousands of times a second or more).
493493

494494
Each key in Redis belongs to one, and only one, shard.
495495
For this reason, a hot key can cause high CPU utilization on that one shard,
496496
which can increase latency for all other operations.
497497

498-
#### Troubleshooting
498+
### Troubleshooting
499499

500500
You may suspect that you have a hot key if you see high CPU utilization on a single shard.
501501
There are two main way to identify hot keys: using the Redis CLI and sampling the operations against Redis.
@@ -512,7 +512,7 @@ against the high CPU shard. Since this a potentially high-impact operation, you
512512
use this technique as a secondary restort. For mission-critical databases, consider
513513
contact Redis support for assistance.
514514

515-
#### Remediation
515+
### Remediation
516516

517517
Once you discover a hot key, you need to find a way to reduce the number of operations against it.
518518
This means getting an understanding of the application's access pattern and the reasons for such frequently access.
@@ -521,19 +521,19 @@ If the hot key operations are read-only, then consider implementing an applicati
521521
that fewer read request are sent to Redis. For example, even a local cache that expires every 5 seconds
522522
can entirely eliminate a hot key issue.
523523

524-
### Large keys
524+
## Large keys
525525

526526
**Large keys** are keys that are hundreds of kilobytes or larger.
527527
High network traffic and high CPU utilization can be caused by large keys.
528528

529-
#### Troubleshooting
529+
### Troubleshooting
530530

531531
To identify large keys, you can sample the keyspace using the Redis CLI.
532532

533533
Run `+redis-cli --memkeys+` against your database to sample the keyspace in real time
534534
and potentially identify the largest keys in your database.
535535

536-
#### Remediation
536+
### Remediation
537537

538538
Addressing a large key issues requires understanding why the application is creating large keys in the first place.
539539
As such, it's difficult to provide general advice to solving this issue. Resolution often requires a change
@@ -551,7 +551,7 @@ To use these alerts, install [Prometheus Alertmanager](https://prometheus.io/doc
551551
For a comprehensive guide to alerting with Prometheus and Grafana,
552552
see the [Grafana blog post on the subject](https://grafana.com/blog/2020/02/25/step-by-step-guide-to-setting-up-prometheus-alertmanager-with-slack-pagerduty-and-gmail/).
553553

554-
### Configuring Prometheus
554+
## Configuring Prometheus
555555

556556
To configure Prometheus for alerting, open the `prometheus.yml` configuration file.
557557

@@ -592,7 +592,7 @@ The following is a list of alerts contained in the `alerts.yml` file. There are
592592
- Not all Redis Enterprise deployments export all metrics
593593
- Most metrics only alert if the specified trigger persists for a given duration
594594

595-
### List of alerts
595+
## List of alerts
596596

597597
| Description | Trigger |
598598
| ------ | :------ |
@@ -634,16 +634,16 @@ a holistic picture of your deployment.
634634

635635
There are two additional sets of dashboards for Redis Enterprise software that provide drill-down functionality: the workflow dashboards.
636636

637-
### Software
637+
## Software
638638
- [Basic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/basic)
639639
- [Extended](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/extended)
640640
- [Classic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/software/classic)
641641

642-
### Workflow
642+
## Workflow
643643
- [Database](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/workflow/databases)
644644
- [Node](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/workflow/nodes)
645645

646-
### Cloud
646+
## Cloud
647647
- [Basic](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/cloud/basic)
648648
- [Extended](https://github.com/redis-field-engineering/redis-enterprise-observability/tree/main/grafana/dashboards/grafana_v9-11/cloud/extended)
649649

0 commit comments

Comments
 (0)