Skip to content

Commit 6a4e62f

Browse files
authored
Added monitoring best practice docs (#5295)
* Added monitoring best practice docs * Fixed grammar in monitoring best practices page
1 parent e86fd75 commit 6a4e62f

14 files changed

+233
-0
lines changed
Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
title: "Best Practices"
3+
order: 4
4+
---
5+
6+
# Monitoring Best Practices
7+
8+
## Background
9+
10+
When monitoring the health of a KurrentDB cluster, one should investigate and alert on multiple factors. Here we discuss them in detail
11+
12+
## Metrics-based Monitoring
13+
14+
The items in this section can be monitored using metrics from KurrentDB's Prometheus endpoint / Grafana dashboard, or via metrics from the operating system / machine / cloud provider
15+
16+
### IOPS
17+
18+
One should **monitor IOPS usage** to ensure it does not increase over 80% of allocation. This should take place at the operating system or machine level
19+
20+
One should also **evaluate IOPS bursts** during extremely heavy periods, start of day / week, etc. to ensure they do not cause exhaustion. This should take place at the operating system or machine level
21+
22+
Finally, **monitoring reader queue lengths** would help organizations understand if IOPS are exhausted as these queues will continue to grow in length, meaning the server never catches up with all read requests. This should take place using Kurrent's Grafana Dashboard
23+
24+
![Reader Queue Lengths](./images/1-reader-queue-lengths.png)
25+
26+
**At the first signs of IOPS exhaustion**, customers are advised to increase their IOPS limits
27+
28+
### Memory Utilization
29+
30+
As a database, we seek to use memory efficiently for improved processing. Organizations should perform a memory capacity confirmation test to establish baseline utilization, and monitoring should be performed to look for deviations from this baseline.
31+
32+
Further, monitoring at the operating system level to ensure that **memory utilization does not exceed 85% of physical memory** helps mitigate allocation exceptions
33+
34+
### Garbage Collection Pauses
35+
36+
Garbage collection monitoring is largely concerned with gen2 memory, where longer-lived objects are allocated. The length of **application pauses for compacting garbage collection** of this generation should be monitored using the Kurrent Grafana Dashboard. Steadily increasing durations may eventually cause a leader election as the database will be unresponsive to heartbeats during compacting garbage collections. Monitor this metric to be below the configured Heartbeat Timeout value (default is 10 seconds, so for most customers, 8 seconds should be appropriate)
37+
38+
![Garbage Collection Pauses](./images/2-garbage-collection-pauses.png)
39+
40+
**To mitigate the impact of compacting garbage collection**, KurrentDB 25.1 and above automatically uses the ServerGC algorithm. If you are using a version of KurrentDB below 25.1, it can be enabled with the following environment variables:
41+
42+
* DOTNET\_gcServer set to 1
43+
* DOTNET\_GCHeapHardLimitPercent set to 3C (which is 60 in HEX)
44+
45+
On startup, KurrentDB will log that it is using ServerGC:
46+
47+
```
48+
[64940, 1,07:15:04.489,INF] EventStore GC: 3 GENERATIONS IsServerGC: True Latency Mode: Interactive
49+
```
50+
51+
### CPU Utilization
52+
53+
To avoid thrashing, monitor **sustained CPU utilization remains below 80%**. This can be done at the operating system level, or on the Kurrent Grafana Dashboard
54+
55+
![CPU Utilization](./images/3-cpu-utilization.png)
56+
57+
### Disk Utilization
58+
59+
Kurrent recommends that organizations configure separate disk locations for logs, data, and indexes to avoid one impacting the other. Monitoring of these spaces should be at the operating system level. Ensure that **log and data disk utilizations are under 90%**. **Index disk utilization should be under 40%**, as additional disk space is required when performing index merges
60+
61+
### Projection Progress
62+
63+
Organizations should monitor the Kurrent Grafana Dashboard to ensure that the **Projection Progress is close to or at 100%** for each projection. This ensures that projections are not falling behind, and keeping pace with appends to the database
64+
65+
![Projection Progress](./images/4-projection-progress.png)
66+
67+
If your **Projection Progress is decreasing, contact Kurrent Support** for analysis and recommendations to mitigate
68+
69+
NOTE: On large databases, this metric could show as 100% but still in fact be far behind due to the number of significant digits when dividing large numbers
70+
71+
### Stopped Projections
72+
73+
**Stopped projections do not execute**, and should be monitored to ensure all components of your database are operational. Use the Kurrent Grafana Dashboard to see which projections are stopped, investigate the cause, and resolve
74+
75+
![Stopped Projection](./images/5-stopped-projection.png)
76+
77+
### Persistent Subscription Lag
78+
79+
To ensure timely delivery of events to subscribers, we monitor Persistent Subscription Lag through the Kurrent Grafana Dashboard. **Persistent Subscription Lag should be as close to 0 as possible** to ensure that persistent subscriptions are caught up and checkpointed
80+
81+
![Persistent Subscription Lag](./images/6-persistent-subscription-lag.png)
82+
83+
### Queue Times
84+
85+
Queue Times tell an organization how long an item is waiting to be processed. There are several queues, including Reader, Worker, Projection, Subscription, and more. Queue times should be monitored using the Kurrent Grafana Dashboard, ideally maintaining **steady queue times with no large spikes or upward trend**
86+
87+
![Queue Times](./images/7-queue-times.png)
88+
89+
If your queue times are increasing, it may be a sign that your hardware is undersized
90+
91+
### Cache Hit Ratio
92+
93+
KurrentDB maintains internal caches of stream names to speed up reads and writes. Use the Kurrent Grafana Dashboard to monitor **Cache Hit Ratio, and aim for a value of 80% or above** to ensure the right stream pointers are kept in memory for streams currently being read / written
94+
95+
![Cache Hit Ratio](./images/8-cache-hit-ratio.png)
96+
97+
If your Cache Hit Ratio is below 80%, or declining, consider increasing the **StreamInfoCacheCapacity** configuration parameter to keep more streams in memory. Be aware that this can increase memory usage and GC pauses.
98+
99+
### Bytes Read and Bytes Written
100+
101+
**Bytes Read and Bytes Written metrics should be relatively even during regular processing**, with minimal spikes indicating irregular load. Spikes in load may indicate upstream or downstream application issues, and cause uneven performance. NOTE: these spikes may also be innocuous and part of regular business process load changes
102+
103+
![Bytes Read and Bytes Written](./images/9-bytes-read-and-written.png)
104+
105+
### Persistent Subscription Parked Messages
106+
107+
The Kurrent Grafana Dashboard reports the number of **Persistent Subscription Parked Messages, which may indicate processing or logic errors** in persistent subscribers. This value should be as close to 0 as possible
108+
109+
![Persistent Subscription Parked Messages](./images/10-persistent-subscription-parked-messages.png)
110+
111+
NOTE: These are events that cannot be handled by the application's logic, and should be reported to the team responsible for application development for appropriate handling and resolution. These events can be replayed or deleted through the Kurrent Web UI or API, if required.
112+
113+
### Node Status
114+
115+
Each cluster needs a Leader and Followers. The Kurrent Grafana Summary dashboard can show, at a glance, the status of each node in the cluster. Organizations should monitor to **make sure there is always one leader, and two followers, available** (for a 3 node cluster). The cluster will function correctly with only two nodes, but the situation should be rectified since with only two nodes available a further failure would cause an outage.
116+
117+
![Node Status](./images/11-node-status.png)
118+
119+
### Replication Lag
120+
121+
In a cluster, events are always appended to the leader, and replicated to the follower nodes. **Replication Lag between the leader and followers should be close to 0 bytes**, and not increasing over time. Organizations should monitor the Kurrent Grafana dashboard Replication Lag and alert if this value is increasing as it is a sign that a follower is unable to keep up with the leader. When a node is restored from a backup, it will have a replication lag while it catches up. The lag should diminish to 0 over time.
122+
123+
![Replication Lag](./images/12-replication-lag.png)
124+
125+
### Failed gRPC Calls
126+
127+
Failed gRPC calls can be monitored on the Kurrent Grafana dashboard, and indicate that a connection or database operation failed. **Ideally, there are 0 failed gRPC operations of any kind**, and failure could be indicative of a number of conditions from network issues, client issues, etc.
128+
129+
![Failed gRPC Calls](./images/13-failed-grpc-calls.png)
130+
131+
## Log-based Monitoring
132+
133+
These items may appear in the KurrentDB log files
134+
135+
### Projection State Size
136+
137+
The **maximum projection state size is 16 MB**. Projection states exceeding this size will fail to checkpoint and enter a faulted state. Once the projection state reaches 8 MB in size \- 50% of the limit \- the server will begin logging messages such as the following:
138+
139+
```
140+
"messageTmpl": "Checkpoint size for the Projection {projectionName} is greater than 8 MB. Checkpoint size for a projection should be less than 16 MB. Current checkpoint size for Projection {projectionName} is {stateSize} MB."
141+
```
142+
143+
Customers should alert on this log message, and reduce the size of the Projection state as quickly as possible to avoid exceeding the maximum. In newer versions of KurrentDB this can be monitored in the metrics \- see later in this document.
144+
145+
### Long Index Merge
146+
147+
While not directly a cause for concern, organizations may wish to **monitor the duration of index merges to ensure their regular maintenance scripts are completing during scheduled windows**, and are not impacting performance of their solutions. Index merge times are logged with messages such as the following:
148+
149+
```
150+
"PTables merge finished in 15:32:13.5846974 ([128000257, 128000194, 128000135, 128000181, 128000211, 46303330995] entries merged into 46943331973)."
151+
```
152+
153+
### Cluster Node Version Mismatch
154+
155+
Except during rolling upgrades, **cluster nodes should be running the exact same version of KurrentDB**. If they are not, the servers will log this as an error state that should be corrected, with messages such as the following:
156+
157+
```json
158+
{
159+
"@t": "2024-04-08T12:15:53.1465960+01:00",
160+
"@mt": "MULTIPLE ES VERSIONS ON CLUSTER NODES FOUND [ (Unspecified/node0.acme.com:2112,24.2.0), (Unspecified/node4.acme.com:2112,23.10.0.0), (Unspecified/node3.acme.com:2112,23.10.0.0), (Unspecified/node2.acme.com:2112,23.10.0.0), (Unspecified/node1.acme.com:2112,23.10.0.0) ]",
161+
"@l": "Warning",
162+
"@i": 498346887,
163+
"SourceContext": "EventStore.Core.Services.Gossip.ClusterMultipleVersionsLogger",
164+
"ProcessId": 77265,
165+
"ThreadId": 17
166+
}
167+
```
168+
169+
### Certificate Expiration Warnings
170+
171+
**When SSL Certificates are expiring soon, cluster nodes will log a warning**. Organization should alert on this warning to ensure that certificates are renewed before expiration to avoid communication failures and cluster-down scenarios. Messages such as the following are logged:
172+
173+
```json
174+
{
175+
"@t": "2024-10-13T02:43:38.9403418+00:00",
176+
"@mt": "Certificates are going to expire in {daysUntilExpiry:N1} days",
177+
"@r": [
178+
"9.6"
179+
],
180+
"@l": "Warning",
181+
"@i": 40508766706,
182+
"daysUntilExpiry": 9.602558584001157,
183+
"SourceContext": "EventStore.Core.ClusterVNode",
184+
"ProcessId": 88375,
185+
"ThreadId": 21
186+
}
187+
```
188+
189+
## General Health Tips
190+
191+
Below are some some general health tips
192+
193+
### Scavenge Regularly
194+
195+
Scavenging removes deleted events and streams, and should be done regularly. Not scavenging or not removing a significant amount of data is not, however, an indicator of poor cluster health. Log entries do report how much space was reclaimed from scavenging, and can, in fact, be a negative number if the scavenging activity did not remove a large number of events.
196+
197+
```json
198+
{
199+
"@host": "cbcns65o0aek8srtnlsg-1.mesdb.eventstore.cloud",
200+
"@i": 211285875,
201+
"ProcessId": 3870715,
202+
"ScavengeId": "f8f18dd4-2db7-4785-ac30-03c5fa30c8d7",
203+
"SourceContext": "EventStore.Core.Services.Storage.StorageScavenger",
204+
"ThreadId": 54,
205+
"elapsed": "40.08:40:28.6613320",
206+
"messageTmpl": "SCAVENGING: Scavenge Completed. Total time taken: {elapsed}. Total space saved: {spaceSaved}.",
207+
"spaceSaved": -11710416
208+
}
209+
```
210+
211+
### Install Patch versions
212+
213+
Kurrent will release patch versions from time to time that contain important fixes. It is strongly recommended to keep up to date with the most recent patch version
214+
215+
### Few Errors / Warnings
216+
217+
During normal operation, KurrentDB produces few errors or warnings. Logs should generally be clean. Customers may wish to monitor for spikes in error / warning rates emitted into log files
218+
219+
### KurrentDB Health Endpoint
220+
221+
When a KurrentDB node is alive and healthy, it returns a response document at the following URL: https://localhost:2113/health/live?liveCode=200
222+
223+
## Conclusion
224+
225+
In this document, we have presented metrics and log entries to monitor, along with some general tips and forward looking metrics
226+
227+
## Further Reading
228+
229+
Please see the following Kurrent Academy resources for more information
230+
231+
* Optimizing KurrentDB: [https://academy.kurrent.io/course/optimizing-kurrentdb](https://academy.kurrent.io/course/optimizing-kurrentdb)
232+
* KurrentDB Operations Courses: [https://academy.kurrent.io/operations-courses](https://academy.kurrent.io/operations-courses)
233+
* KurrentDB Diagnostics: [https://docs.kurrent.io/server/latest/diagnostics/](https://docs.kurrent.io/server/latest/diagnostics/)
355 KB
Loading
143 KB
Loading
303 KB
Loading
194 KB
Loading
337 KB
Loading
243 KB
Loading
248 KB
Loading
235 KB
Loading
389 KB
Loading

0 commit comments

Comments
 (0)