@@ -173,6 +173,39 @@ and limit as well as actually utilization.
173173device.
174174- Container ResourceS: The CPU and memory limits and requests.
175175
176+ ### Backups
177+
178+ ![ PostgreSQL Operator - Monitoring - Backup Health] ( /images/postgresql-monitoring-backups.png )
179+
180+ There are a variety of reasons why you need to monitoring your backups, starting
181+ from answering the fundamental question of "do I have backups available?"
182+ Backups can be used for a variety of situations, from cloning new clusters to
183+ restoring clusters after a disaster. Additionally, Postgres can run into issues
184+ if your backup repository is not healthy, e.g. if it cannot push WAL archives.
185+ If your backups are set up properly and healthy, you will be set up to mitigate
186+ the risk of data loss!
187+
188+ The backup, or pgBackRest panel, will provide information about the overall
189+ state of your backups. This includes:
190+
191+ - Recovery Window: This is an indicator of how far back you are able to restore
192+ your data from. This represents all of the backups and archives available in
193+ your backup repository. Typically, your recovery window should be close to your
194+ overall data retention specifications.
195+ - Time Since Last Backup: this indicates how long it has been since your last
196+ backup. This is broken down into pgBackRest backup type (full, incremental,
197+ differential) as well as time since the last WAL archive was pushed.
198+ - Backup Runtimes: How long the last backup of a given type (full, incremental
199+ differential) took to execute. If your backups are slow, consider providing more
200+ resources to the backup jobs and tweaking pgBackRest's performance tuning
201+ settings.
202+ - Backup Size: How large the backups of a given type (full, incremental,
203+ differential).
204+ - WAL Stats: Shows the metrics around WAL archive pushes. If you have failing
205+ pushes, you should to see if there is a transient or permanent error that is
206+ preventing WAL archives from being pushed. If left untreated, this could end up
207+ causing issues for your Postgres cluster.
208+
176209### PostgreSQL Service Health Overview
177210
178211![ PostgreSQL Operator Monitoring - Service Health Overview] ( /images/postgresql-monitoring-service.png )
@@ -190,6 +223,43 @@ handling.
190223- Latency: What the overall network latency is when interfacing with the
191224Service.
192225
226+ ### Query Runtime
227+
228+ ![ PostgreSQL Operator Monitoring - Query Performance] ( /images/postgresql-monitoring-query-total.png )
229+
230+ Looking at the overall performance of queries can help optimize a Postgres
231+ deployment, both from [ providing resources] ({{< relref "tutorial/customize-cluster.md" >}}) to query tuning in the application
232+ itself.
233+
234+ You can get a sense of the overall activity of a PostgreSQL cluster from the
235+ chart that is visualized above:
236+
237+ - Queries Executed: The total number of queries executed on a system during the
238+ period.
239+ - Query runtime: The aggregate runtime of all the queries combined across the
240+ system that were executed in the period.
241+ - Query mean runtime: The average query time across all queries executed on the
242+ system in the given period.
243+ - Rows retrieved or affected: The total number of rows in a database that were
244+ either retrieved or had modifications made to them.
245+
246+ PostgreSQL Operator Monitoring also further breaks down the queries so you can
247+ identify queries that are being executed too frequently or are taking up too
248+ much time.
249+
250+ ![ PostgreSQL Operator Monitoring - Query Analysis] ( /images/postgresql-monitoring-query-topn.png )
251+
252+ - Query Mean Runtime (Top N): This highlights the N number of slowest queries by
253+ average runtime on the system. This might indicate you are missing an index
254+ somewhere, or perhaps the query could be rewritten to be more efficient.
255+ - Query Max Runtime (Top N): This highlights the N number of slowest queries by
256+ absolute runtime. This could indicate that a specific query or the system as a
257+ whole may need more resources.
258+ - Query Total Runtime (Top N): This highlights the N of slowest queres by
259+ aggregate runtime. This could indicate that a ORM is looping over a single query
260+ and executing it many times that could possibly be rewritten as a single, faster
261+ query.
262+
193263### Alerts
194264
195265![ PostgreSQL Operator Monitoring - Alerts] ( /images/postgresql-monitoring-alerts.png )
0 commit comments