Skip to content

Commit 25dbd43

Browse files
authored
Update monitoring documentation
This adds information about how the PostgreSQL Operator Monitoring stack works, provides guidance on how to mitigate issues, and adds a bunch of pictures. Issue: [ch8769]
1 parent 6722f14 commit 25dbd43

File tree

10 files changed

+279
-80
lines changed

10 files changed

+279
-80
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,27 +9,27 @@
99

1010
The [Crunchy PostgreSQL Operator](https://access.crunchydata.com/documentation/postgres-operator/) automates and simplifies deploying and managing open source PostgreSQL clusters on Kubernetes and other Kubernetes-enabled Platforms by providing the essential features you need to keep your PostgreSQL clusters up and running, including:
1111

12-
#### PostgreSQL Cluster Provisioning
12+
#### PostgreSQL Cluster [Provisioning](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/provisioning/)
1313

1414
[Create, Scale, & Delete PostgreSQL clusters with ease](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/provisioning/), while fully customizing your Pods and PostgreSQL configuration!
1515

16-
#### High-Availability
16+
#### [High Availability](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/high-availability/)
1717

1818
Safe, automated failover backed by a [distributed consensus based high-availability solution](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/high-availability/). Uses [Pod Anti-Affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity) to help resiliency; you can configure how aggressive this can be! Failed primaries automatically heal, allowing for faster recovery time.
1919

2020
Support for [standby PostgreSQL clusters](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/high-availability/multi-cluster-kubernetes/) that work both within an across [multiple Kubernetes clusters](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/high-availability/multi-cluster-kubernetes/).
2121

22-
#### Disaster Recovery
22+
#### [Disaster Recovery](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/disaster-recovery/)
2323

2424
Backups and restores leverage the open source [pgBackRest](https://www.pgbackrest.org) utility and [includes support for full, incremental, and differential backups as well as efficient delta restores](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/disaster-recovery/). Set how long you want your backups retained for. Works great with very large databases!
2525

2626
#### TLS
2727

2828
Secure communication between your applications and data servers by [enabling TLS for your PostgreSQL servers](https://access.crunchydata.com/documentation/postgres-operator/latest/pgo-client/common-tasks/#enable-tls), including the ability to enforce that all of your connections to use TLS.
2929

30-
#### Monitoring
30+
#### [Monitoring](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/monitoring/)
3131

32-
Track the health of your PostgreSQL clusters using the open source [pgMonitor](https://github.com/CrunchyData/pgmonitor) library.
32+
[Track the health of your PostgreSQL clusters](https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/monitoring/) using the open source [pgMonitor](https://github.com/CrunchyData/pgmonitor) library.
3333

3434
#### PostgreSQL User Management
3535

docs/content/_index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,27 +14,29 @@ Latest Release: {{< param operatorVersion >}}
1414

1515
The [Crunchy PostgreSQL Operator](https://www.crunchydata.com/developers/download-postgres/containers/postgres-operator) automates and simplifies deploying and managing open source PostgreSQL clusters on Kubernetes and other Kubernetes-enabled Platforms by providing the essential features you need to keep your PostgreSQL clusters up and running, including:
1616

17-
#### PostgreSQL Cluster Provisioning
17+
#### PostgreSQL Cluster [Provisioning]({{< relref "/architecture/provisioning.md" >}})
1818

1919
[Create, Scale, & Delete PostgreSQL clusters with ease](/architecture/provisioning/), while fully customizing your Pods and PostgreSQL configuration!
2020

21-
#### High-Availability
21+
#### [High Availability]({{< relref "/architecture/high-availability/_index.md" >}})
2222

2323
Safe, automated failover backed by a [distributed consensus based high-availability solution](/architecture/high-availability/). Uses [Pod Anti-Affinity](https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity) to help resiliency; you can configure how aggressive this can be! Failed primaries automatically heal, allowing for faster recovery time.
2424

2525
Support for [standby PostgreSQL clusters]({{< relref "/architecture/high-availability/multi-cluster-kubernetes.md" >}}) that work both within an across [multiple Kubernetes clusters]({{< relref "/architecture/high-availability/multi-cluster-kubernetes.md" >}}).
2626

27-
#### Disaster Recovery
27+
#### [Disaster Recovery]({{< relref "/architecture/disaster-recovery.md" >}})
2828

2929
Backups and restores leverage the open source [pgBackRest](https://www.pgbackrest.org) utility and [includes support for full, incremental, and differential backups as well as efficient delta restores](/architecture/disaster-recovery/). Set how long you want your backups retained for. Works great with very large databases!
3030

3131
#### TLS
3232

3333
Secure communication between your applications and data servers by [enabling TLS for your PostgreSQL servers](/pgo-client/common-tasks/#enable-tls), including the ability to enforce that all of your connections to use TLS.
3434

35-
#### Monitoring
35+
#### [Monitoring]({{< relref "/architecture/monitoring.md" >}})
3636

37-
Track the health of your PostgreSQL clusters using the open source [pgMonitor](https://github.com/CrunchyData/pgmonitor) library.
37+
[Track the health of your PostgreSQL clusters]({{< relref "/architecture/monitoring.md" >}})
38+
using the open source [pgMonitor](https://github.com/CrunchyData/pgmonitor)
39+
library.
3840

3941
#### PostgreSQL User Management
4042

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
---
2+
title: "Monitoring"
3+
date:
4+
draft: false
5+
weight: 350
6+
---
7+
8+
![PostgreSQL Operator Monitoring](/images/postgresql-monitoring.png)
9+
10+
While having [high availability]({{< relref "architecture/high-availability/_index.md" >}})
11+
and [disaster recovery]({{< relref "architecture/disaster-recovery.md" >}})
12+
systems in place helps in the event of something going wrong with your
13+
PostgreSQL cluster, monitoring helps you anticipate problems before they happen.
14+
Additionally, monitoring can help you diagnose and resolve additional issues
15+
that may not result in downtime, but cause degraded performance.
16+
17+
There are many different ways to monitor systems within Kubernetes, including
18+
tools that come with Kubernetes itself. This is by no means to be a
19+
comprehensive on how to monitor everything in Kubernetes, but rather what the
20+
PostgreSQL Operator provides to give you an
21+
[out-of-the-box monitoring solution](({{< relref "installation/metrics/_index.md" >}}).
22+
23+
## Getting Started
24+
25+
If you want to install the metrics stack, please visit the [installation]({{< relref "installation/metrics/_index.md" >}})
26+
instructions for the [PostgreSQL Operator Monitoring]({{< relref "installation/metrics/_index.md" >}})
27+
stack.
28+
29+
Once the metrics stack is set up, you will need to deploy your PostgreSQL
30+
clusters with monitoring enabled. To do so, you will need to use the `--metrics`
31+
flag as part of the [`pgo create cluster`]({{< relref "pgo-client/reference/pgo_create_cluster.md" >}})
32+
command, for example:
33+
34+
```
35+
pgo create cluster --metrics hippo
36+
```
37+
38+
## Components
39+
40+
The [PostgreSQL Operator Monitoring]({{< relref "installation/metrics/_index.md" >}})
41+
stack is made up of several open source components:
42+
43+
- [pgMonitor](https://github.com/CrunchyData/pgmonitor), which provides the core
44+
of the monitoring infrastructure including the following components:
45+
- [postgres_exporter](https://github.com/CrunchyData/pgmonitor/tree/master/exporter/postgres),
46+
which provides queries used to collect metrics information about a PostgreSQL
47+
instance.
48+
- [Prometheus](https://github.com/prometheus/prometheus), a time-series
49+
database that scrapes and stores the collected metrics so they can be consumed
50+
by other services.
51+
- [Grafana](https://github.com/grafana/grafana), a visualization tool that
52+
provides charting and other capabilities for viewing the collected monitoring
53+
data.
54+
- [Alertmanager](https://github.com/prometheus/alertmanager), a tool that
55+
can send alerts when metrics hit a certain threshold that require someone to
56+
intervene.
57+
- [pgnodemx](https://github.com/CrunchyData/pgnodemx), a PostgreSQL extension
58+
that is able to pull container-specific metrics (e.g. CPU utilization, memory
59+
consumption) from the container itself via SQL queries.
60+
61+
## Visualizations
62+
63+
Below is a brief description of all the visualizations provided by the
64+
[PostgreSQL Operator Monitoring]({{< relref "installation/metrics/_index.md" >}})
65+
stack. Some of the descriptions may include some directional guidance on how to
66+
interpret the charts, though this is only to provide a starting point: actual
67+
causes and effects of issues can vary between systems.
68+
69+
Many of the visualizations can be broken down based on the following groupings:
70+
71+
- Cluster: which PostgreSQL cluster should be viewed
72+
- Pod: the specific Pod or PostgreSQL instance
73+
74+
### Overview
75+
76+
![PostgreSQL Operator Monitoring - Overview](/images/postgresql-monitoring-overview.png)
77+
78+
The overview provides an overview of all of the PostgreSQL clusters that are
79+
being monitoring by the PostgreSQL Operator Monitoring stack. This includes the
80+
following information:
81+
82+
- The name of the PostgreSQL cluster and the namespace that it is in
83+
- The type of PostgreSQL cluster (HA [high availability] or standalone)
84+
- The status of the cluster, as indicate by color. Green indicates the cluster
85+
is available, red indicates that it is not.
86+
87+
Each entry is clickable to provide additional cluster details.
88+
89+
### PostgreSQL Details
90+
91+
![PostgreSQL Operator Monitoring - Cluster Cluster Details](/images/postgresql-monitoring.png)
92+
93+
The PostgreSQL Details view provides more information about a specific
94+
PostgreSQL cluster that is being managed and monitored by the PostgreSQL
95+
Operator. These include many key PostgreSQL-specific metrics that help make
96+
decisions around managing a PostgreSQL cluster. These include:
97+
98+
- Backup Status: The last time a backup was taken of the cluster. Green is good.
99+
Orange means that a backup has not been taken in more than a day and may warrant
100+
investigation.
101+
- Active Connections: How many clients are connected to the database. Too many
102+
clients connected could impact performance and, for values approaching 100%, can
103+
lead to clients being unable to connect.
104+
- Idle in Transaction: How many clients have a connection state of "idle in
105+
transaction". Too many clients in this state can cause performance issues and,
106+
in certain cases, maintenance issues.
107+
- Idle: How many clients are connected but are in an "idle" state.
108+
- TPS: The number of "transactions per second" that are occurring. Usually needs
109+
to be combined with another metric to help with analysis. "Higher is better"
110+
when performing benchmarking.
111+
- Connections: An aggregated view of active, idle, and idle in transaction
112+
connections.
113+
- Database Size: How large databases are within a PostgreSQL cluster. Typically
114+
combined with another metric for analysis. Helps keep track of overall disk
115+
usage and if any triage steps need to occur around PVC size.
116+
- WAL Size: How much space write-ahead logs (WAL) are taking up on disk. This
117+
can contribute to extra space being used on your data disk, or can give you an
118+
indication of how much space is being utilized on a separate WAL PVC. If you
119+
are using replication slots, this can help indicate if a slot is not being
120+
acknowledged if the numbers are much larger than the `max_wal_size` setting (the
121+
PostgreSQL Operator does not use slots by default).
122+
- Row Activity: The number of rows that are selected, inserted, updated, and
123+
deleted. This can help you determine what percentage of your workload is read
124+
vs. write, and help make database tuning decisions based on that, in conjunction
125+
with other metrics.
126+
- Replication Status: Provides guidance information on how much replication lag
127+
there is between primary and replica PostgreSQL instances, both in bytes and
128+
time. This can provide an indication of how much data could be lost in the event
129+
of a failover.
130+
131+
![PostgreSQL Operator Monitoring - Cluster Cluster Details 2](/images/postgresql-monitoring-cluster.png)
132+
133+
- Conflicts / Deadlocks: These occur when PostgreSQL is unable to complete
134+
operations, which can result in transaction loss. The goal is for these numbers
135+
to be `0`. If these are occurring, check your data access and writing patterns.
136+
- Cache Hit Ratio: A measure of how much of the "working data", e.g. data that
137+
is being accessed and manipulated, resides in memory. This is used to understand
138+
how much PostgreSQL is having to utilize the disk. The target number of this
139+
should be as high as possible. How to achieve this is the subject of books, but
140+
certain takes efforts on your applications use PostgreSQL.
141+
- Buffers: The buffer usage of various parts of the PostgreSQL system. This can
142+
be used to help understand the overall throughput between various parts of the
143+
system.
144+
- Commit & Rollback: How many transactions are committed and rolled back.
145+
- Locks: The number of locks that are present on a given system.
146+
147+
### Pod Details
148+
149+
![PostgreSQL Operator Monitoring - Pod Details](/images/postgresql-monitoring-pod.png)
150+
151+
Pod details provide information about a given Pod or Pods that are being used
152+
by a PostgreSQL cluster. These are similar to "operating system" or "node"
153+
metrics, with the differences that these are looking at resource utilization by
154+
a container, not the entire node.
155+
156+
It may be helpful to view these metrics on a "pod" basis, by using the Pod
157+
filter at the top of the dashboard.
158+
159+
- Disk Usage: How much space is being consumed by a volume.
160+
- Disk Activity: How many reads and writes are occurring on a volume.
161+
- Memory: Various information about memory utilization, including the request
162+
and limit as well as actually utilization.
163+
- CPU: The amount of CPU being utilized by a Pod
164+
- Network Traffic: The amount of networking traffic passing through each network
165+
device.
166+
- Container ResourceS: The CPU and memory limits and requests.
167+
168+
### PostgreSQL Service Health Overview
169+
170+
![PostgreSQL Operator Monitoring - Service Health Overview](/images/postgresql-monitoring-service.png)
171+
172+
The Service Health Overview provides information about the Kubernetes Services
173+
that sit in front of the PostgreSQL Pods. This provides information about the
174+
status of the network.
175+
176+
- Saturation: How much of the available network to the Service is being
177+
consumed. High saturation may cause degraded performance to clients or create
178+
an inability to connect to the PostgreSQL cluster.
179+
- Traffic: Displays the number of transactions per minute that the Service is
180+
handling.
181+
- Errors: Displays the total number of errors occurring at a particular Service.
182+
- Latency: What the overall network latency is when interfacing with the
183+
Service.
184+
185+
### Alerts
186+
187+
![PostgreSQL Operator Monitoring - Alerts](/images/postgresql-monitoring-alerts.png)
188+
189+
Alerting lets one view and receive alerts about actions that require
190+
intervention, for example, a HA cluster that cannot self-heal. The alerting
191+
system is powered by [Alertmanager](https://github.com/prometheus/alertmanager).
192+
193+
The alerts that come installed by default include:
194+
195+
- `PGExporterScrapeError`: The Crunchy PostgreSQL Exporter is having issues
196+
scraping statistics used as part of the monitoring stack.
197+
- `PGIsUp`: A PostgreSQL instance is down.
198+
- `PGIdleTxn`: There are too many connections that are in the
199+
"idle in transaction" state.
200+
- `PGQueryTime`: A single PostgreSQL query is taking too long to run. Issues a
201+
warning at 12 hours and goes critical after 24.
202+
- `PGConnPerc`: Indicates that there are too many connection slots being used.
203+
Issues a warning at 75% and goes critical above 90%.
204+
- `PGDBSize`: Indicates that a PostgreSQL database is too large and could be in
205+
danger of running out of disk space. Issues a warning at 75% and goes critical
206+
at 90%.
207+
- `PGReplicationByteLag`: Indicates that a replica is too far behind a primary
208+
instance, which coul risk data loss in a failover scenario. Issues a warning at
209+
50MB an goes critical at 100MB.
210+
- `PGReplicationSlotsInactive`: Indicates that a replication slot is inactive.
211+
Not attending to this can lead to out-of-disk errors.
212+
- `PGXIDWraparound`: Indicates that a PostgreSQL instance is nearing transaction
213+
ID wraparound. Issues a warning at 50% and goes critical at 75%. It's important
214+
that you [vacuum your database](https://info.crunchydata.com/blog/managing-transaction-id-wraparound-in-postgresql)
215+
to prevent this.
216+
- `PGEmergencyVacuum`: Indicates that autovacuum is not running, i.e. it's past
217+
its "freeze" age. Issues a warning at 110% and goes critical at 125%.
218+
- `PGArchiveCommandStatus`: Indicates that the archive command, which is used
219+
to ship WAL archives to pgBackRest, is failing.
220+
- `PGSequenceExhaustion`: Indicates that a sequence is over 75% used.
221+
- `PGSettingsPendingRestart`: Indicates that there are settings changed on a
222+
PostgreSQL instance that requires a restart.
223+
224+
Optional alerts that can be enabled:
225+
226+
- `PGMinimumVersion`: Indicates if PostgreSQL is below a desired version.
227+
- `PGRecoveryStatusSwitch_Replica`: Indicates that a replica has been promoted
228+
to a primary.
229+
- `PGConnectionAbsent_Prod`: Indicates that metrics collection is absent from a
230+
PostgresQL instance.
231+
- `PGSettingsChecksum`: Indicates that PostgreSQL settings have changed from a
232+
previous state.
233+
- `PGDataChecksum`: Indicates that there are data checksum failures on a
234+
PostgreSQL instance. This could be a sign of data corruption.
235+
236+
You can modify these alerts as you see fit, and add your own alerts as well!
237+
Please see the [installation instructions]((({{< relref "installation/metrics/_index.md" >}}))
238+
for general setup of the PostgreSQL Operator Monitoring stack.

0 commit comments

Comments
 (0)