Skip to content

Commit ea29237

Browse files
author
Jonathan S. Katz
committed
Update monitoring documentation
This updates the monitoring documentation to describe the enhancements that were added in 994846e.
1 parent e11148c commit ea29237

File tree

6 files changed

+70
-0
lines changed

6 files changed

+70
-0
lines changed

docs/content/architecture/monitoring.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,6 +173,39 @@ and limit as well as actually utilization.
173173
device.
174174
- Container ResourceS: The CPU and memory limits and requests.
175175

176+
### Backups
177+
178+
![PostgreSQL Operator - Monitoring - Backup Health](/images/postgresql-monitoring-backups.png)
179+
180+
There are a variety of reasons why you need to monitoring your backups, starting
181+
from answering the fundamental question of "do I have backups available?"
182+
Backups can be used for a variety of situations, from cloning new clusters to
183+
restoring clusters after a disaster. Additionally, Postgres can run into issues
184+
if your backup repository is not healthy, e.g. if it cannot push WAL archives.
185+
If your backups are set up properly and healthy, you will be set up to mitigate
186+
the risk of data loss!
187+
188+
The backup, or pgBackRest panel, will provide information about the overall
189+
state of your backups. This includes:
190+
191+
- Recovery Window: This is an indicator of how far back you are able to restore
192+
your data from. This represents all of the backups and archives available in
193+
your backup repository. Typically, your recovery window should be close to your
194+
overall data retention specifications.
195+
- Time Since Last Backup: this indicates how long it has been since your last
196+
backup. This is broken down into pgBackRest backup type (full, incremental,
197+
differential) as well as time since the last WAL archive was pushed.
198+
- Backup Runtimes: How long the last backup of a given type (full, incremental
199+
differential) took to execute. If your backups are slow, consider providing more
200+
resources to the backup jobs and tweaking pgBackRest's performance tuning
201+
settings.
202+
- Backup Size: How large the backups of a given type (full, incremental,
203+
differential).
204+
- WAL Stats: Shows the metrics around WAL archive pushes. If you have failing
205+
pushes, you should to see if there is a transient or permanent error that is
206+
preventing WAL archives from being pushed. If left untreated, this could end up
207+
causing issues for your Postgres cluster.
208+
176209
### PostgreSQL Service Health Overview
177210

178211
![PostgreSQL Operator Monitoring - Service Health Overview](/images/postgresql-monitoring-service.png)
@@ -190,6 +223,43 @@ handling.
190223
- Latency: What the overall network latency is when interfacing with the
191224
Service.
192225

226+
### Query Runtime
227+
228+
![PostgreSQL Operator Monitoring - Query Performance](/images/postgresql-monitoring-query-total.png)
229+
230+
Looking at the overall performance of queries can help optimize a Postgres
231+
deployment, both from [providing resources]({{< relref "tutorial/customize-cluster.md" >}}) to query tuning in the application
232+
itself.
233+
234+
You can get a sense of the overall activity of a PostgreSQL cluster from the
235+
chart that is visualized above:
236+
237+
- Queries Executed: The total number of queries executed on a system during the
238+
period.
239+
- Query runtime: The aggregate runtime of all the queries combined across the
240+
system that were executed in the period.
241+
- Query mean runtime: The average query time across all queries executed on the
242+
system in the given period.
243+
- Rows retrieved or affected: The total number of rows in a database that were
244+
either retrieved or had modifications made to them.
245+
246+
PostgreSQL Operator Monitoring also further breaks down the queries so you can
247+
identify queries that are being executed too frequently or are taking up too
248+
much time.
249+
250+
![PostgreSQL Operator Monitoring - Query Analysis](/images/postgresql-monitoring-query-topn.png)
251+
252+
- Query Mean Runtime (Top N): This highlights the N number of slowest queries by
253+
average runtime on the system. This might indicate you are missing an index
254+
somewhere, or perhaps the query could be rewritten to be more efficient.
255+
- Query Max Runtime (Top N): This highlights the N number of slowest queries by
256+
absolute runtime. This could indicate that a specific query or the system as a
257+
whole may need more resources.
258+
- Query Total Runtime (Top N): This highlights the N of slowest queres by
259+
aggregate runtime. This could indicate that a ORM is looping over a single query
260+
and executing it many times that could possibly be rewritten as a single, faster
261+
query.
262+
193263
### Alerts
194264

195265
![PostgreSQL Operator Monitoring - Alerts](/images/postgresql-monitoring-alerts.png)
-97.6 KB
Loading
548 KB
Loading
276 KB
Loading
117 KB
Loading
63.3 KB
Loading

0 commit comments

Comments
 (0)