Skip to content

Commit ab797d3

Browse files
authored
Merge pull request #62710 from CarmiWisemon/oadp2300metric
2 parents 911fe1e + c157822 commit ab797d3

11 files changed

+464
-38
lines changed

backup_and_restore/application_backup_and_restore/troubleshooting.adoc

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ You can debug Velero custom resources (CRs) by using the xref:../../backup_and_r
1313

1414
You can check xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-installation-issues_oadp-troubleshooting[installation issues], xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-backup-restore-cr-issues_oadp-troubleshooting[backup and restore CR issues], and xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-restic-issues_oadp-troubleshooting[Restic issues].
1515

16-
You can collect logs, CR information, and Prometheus metric data by using the xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#migration-using-must-gather_oadp-troubleshooting[`must-gather` tool].
16+
You can collect logs and CR information by using the xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#migration-using-must-gather_oadp-troubleshooting[`must-gather` tool].
1717

1818
You can obtain the Velero CLI tool by:
1919

@@ -89,5 +89,19 @@ include::modules/oadp-backup-restore-cr-issues.adoc[leveloffset=+1]
8989
include::modules/oadp-restic-issues.adoc[leveloffset=+1]
9090

9191
include::modules/migration-using-must-gather.adoc[leveloffset=+1]
92+
include::modules/oadp-monitoring.adoc[leveloffset=+1]
93+
[role="_additional-resources"]
94+
.Additional resources
95+
* xref:../../monitoring/monitoring-overview.adoc#about-openshift-monitoring[Monitoring stack]
96+
97+
include::modules/oadp-monitoring-setup.adoc[leveloffset=+2]
98+
include::modules/oadp-creating-service-monitor.adoc[leveloffset=+2]
99+
include::modules/oadp-creating-alerting-rule.adoc[leveloffset=+2]
100+
[role="_additional-resources"]
101+
.Additional resources
102+
* xref:../../monitoring/managing-alerts.adoc#managing-alerts[Managing alerts]
103+
104+
include::modules/oadp-list-of-metrics.adoc[leveloffset=+2]
105+
include::modules/oadp-viewing-metrics-ui.adoc[leveloffset=+2]
92106

93107
:!oadp-troubleshooting:

images/oadp-backup-failing-alert.png

46.6 KB
Loading

images/oadp-metrics-query.png

77.8 KB
Loading

images/oadp-metrics-targets.png

49.4 KB
Loading

modules/migration-using-must-gather.adoc

Lines changed: 0 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -82,40 +82,3 @@ $ oc adm must-gather --image={must-gather} \
8282
+
8383
This operation can take a long time. The data is saved as `must-gather/metrics/prom_data.tar.gz`.
8484
85-
[discrete]
86-
[id="viewing-data-with-prometheus-console_{context}"]
87-
== Viewing metrics data with the Prometheus console
88-
89-
You can view the metrics data with the Prometheus console.
90-
91-
.Procedure
92-
93-
. Decompress the `prom_data.tar.gz` file:
94-
+
95-
[source,terminal]
96-
----
97-
$ tar -xvzf must-gather/metrics/prom_data.tar.gz
98-
----
99-
100-
. Create a local Prometheus instance:
101-
+
102-
[source,terminal]
103-
----
104-
$ make prometheus-run
105-
----
106-
+
107-
The command outputs the Prometheus URL.
108-
+
109-
.Output
110-
[source,terminal]
111-
----
112-
Started Prometheus on http://localhost:9090
113-
----
114-
115-
. Launch a web browser and navigate to the URL to view the data by using the Prometheus web console.
116-
. After you have viewed the data, delete the Prometheus instance and data:
117-
+
118-
[source,terminal]
119-
----
120-
$ make prometheus-cleanup
121-
----
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="creating-alerting-rules_{context}"]
7+
= Creating an alerting rule
8+
9+
The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.
10+
11+
.Procedure
12+
13+
. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`.
14+
+
15+
.Sample `OADPBackupFailing` alert
16+
[source,yaml]
17+
+
18+
----
19+
apiVersion: monitoring.coreos.com/v1
20+
kind: PrometheusRule
21+
metadata:
22+
name: sample-oadp-alert
23+
namespace: openshift-adp
24+
spec:
25+
groups:
26+
- name: sample-oadp-backup-alert
27+
rules:
28+
- alert: OADPBackupFailing
29+
annotations:
30+
description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
31+
summary: OADP has issues creating backups
32+
expr: |
33+
increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
34+
for: 5m
35+
labels:
36+
severity: warning
37+
----
38+
+
39+
In this sample, the Alert displays under the following conditions:
40+
+
41+
* There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.
42+
* If the time of the first increase is less than 5 minutes, the Alert will be in a `Pending` state, after which it will turn into a `Firing` state.
43+
+
44+
. Apply the `4_create_oadp_alert_rule.yaml` file, which creates the `PrometheusRule` object in the `openshift-adp` namespace:
45+
+
46+
[source,terminal]
47+
----
48+
$ oc apply -f 4_create_oadp_alert_rule.yaml
49+
----
50+
+
51+
.Example output
52+
[source,terminal]
53+
----
54+
prometheusrule.monitoring.coreos.com/sample-oadp-alert created
55+
----
56+
57+
.Verification
58+
. After the Alert is triggered, you can view it in the following ways:
59+
** In the *Developer* view, select the *Observe* menu.
60+
** In the *Administrator* view under *Observe* -> *Alerting* menu, select *User* in the *Filter* box. Otherwise, by default only the *Platform* Alerts are displayed.
61+
+
62+
.OADP backup failing alert
63+
64+
image::oadp-backup-failing-alert.png[OADP backup failing alert]
65+
66+
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
4+
5+
:_content-type: PROCEDURE
6+
[id="oadp-creating-service-monitor_{context}"]
7+
= Creating OADP service monitor
8+
9+
OADP provides an `openshift-adp-velero-metrics-svc` service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.
10+
11+
Get details about the service by running the following commands:
12+
13+
.Procedure
14+
15+
. Ensure the `openshift-adp-velero-metrics-svc` exists. It should contain `app.kubernetes.io/name=velero` label which will be used as selector for the `ServiceMonitor` object.
16+
17+
+
18+
[source,terminal]
19+
----
20+
$ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
21+
----
22+
+
23+
.Example output
24+
[source,terminal]
25+
----
26+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
27+
openshift-adp-velero-metrics-svc ClusterIP 172.30.38.244 <none> 8085/TCP 1h
28+
----
29+
+
30+
. Create a `ServiceMonitor` YAML file that matches the existing service label, and save the file as `3_create_oadp_service_monitor.yaml`. The service monitor is created in the `openshift-adp` namespace where the `openshift-adp-velero-metrics-svc` service resides.
31+
+
32+
.Example `ServiceMonitor` object
33+
[source,yaml]
34+
+
35+
----
36+
apiVersion: monitoring.coreos.com/v1
37+
kind: ServiceMonitor
38+
metadata:
39+
labels:
40+
app: oadp-service-monitor
41+
name: oadp-service-monitor
42+
namespace: openshift-adp
43+
spec:
44+
endpoints:
45+
- interval: 30s
46+
path: /metrics
47+
targetPort: 8085
48+
scheme: http
49+
selector:
50+
matchLabels:
51+
app.kubernetes.io/name: "velero"
52+
----
53+
+
54+
. Apply the `3_create_oadp_service_monitor.yaml` file:
55+
+
56+
[source,terminal]
57+
----
58+
$ oc apply -f 3_create_oadp_service_monitor.yaml
59+
----
60+
+
61+
.Example output
62+
[source,terminal]
63+
----
64+
servicemonitor.monitoring.coreos.com/oadp-service-monitor created
65+
----
66+
67+
.Verification
68+
69+
* Confirm that the new service monitor is in an *Up* state by using the *Administrator* perspective of the {product-title} web console:
70+
.. Navigate to the *Observe* -> *Targets* page.
71+
.. Ensure the *Filter* is unselected or that the *User* source is selected and type `openshift-adp` in the `Text` search field.
72+
.. Verify that the status for the *Status* for the service monitor is *Up*.
73+
+
74+
.OADP metrics targets
75+
76+
image::oadp-metrics-targets.png[OADP metrics targets]

modules/oadp-list-of-metrics.adoc

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
4+
5+
:_content-type: REFERENCE
6+
[id="list-of-metrics_{context}"]
7+
= List of available metrics
8+
9+
These are the list of metrics provided by the OADP together with their https://prometheus.io/docs/concepts/metric_types/[Types].
10+
11+
|===
12+
|Metric name |Description |Type
13+
14+
|`kopia_content_cache_hit_bytes`
15+
|Number of bytes retrieved from the cache
16+
|Counter
17+
18+
|`kopia_content_cache_hit_count`
19+
|Number of times content was retrieved from the cache
20+
|Counter
21+
22+
|`kopia_content_cache_malformed`
23+
|Number of times malformed content was read from the cache
24+
|Counter
25+
26+
|`kopia_content_cache_miss_count`
27+
|Number of times content was not found in the cache and fetched
28+
|Counter
29+
30+
|`kopia_content_cache_missed_bytes`
31+
|Number of bytes retrieved from the underlying storage
32+
|Counter
33+
34+
|`kopia_content_cache_miss_error_count`
35+
|Number of times content could not be found in the underlying storage
36+
|Counter
37+
38+
|`kopia_content_cache_store_error_count`
39+
|Number of times content could not be saved in the cache
40+
|Counter
41+
42+
|`kopia_content_get_bytes`
43+
|Number of bytes retrieved using `GetContent()`
44+
|Counter
45+
46+
|`kopia_content_get_count`
47+
|Number of times `GetContent()` was called
48+
|Counter
49+
50+
|`kopia_content_get_error_count`
51+
|Number of times `GetContent()` was called and the result was an error
52+
|Counter
53+
54+
|`kopia_content_get_not_found_count`
55+
|Number of times `GetContent()` was called and the result was not found
56+
|Counter
57+
58+
|`kopia_content_write_bytes`
59+
|Number of bytes passed to `WriteContent()`
60+
|Counter
61+
62+
|`kopia_content_write_count`
63+
|Number of times `WriteContent()` was called
64+
|Counter
65+
66+
|`velero_backup_attempt_total`
67+
|Total number of attempted backups
68+
|Counter
69+
70+
|`velero_backup_deletion_attempt_total`
71+
|Total number of attempted backup deletions
72+
|Counter
73+
74+
|`velero_backup_deletion_failure_total`
75+
|Total number of failed backup deletions
76+
|Counter
77+
78+
|`velero_backup_deletion_success_total`
79+
|Total number of successful backup deletions
80+
|Counter
81+
82+
|`velero_backup_duration_seconds`
83+
|Time taken to complete backup, in seconds
84+
|Histogram
85+
86+
|`velero_backup_failure_total`
87+
|Total number of failed backups
88+
|Counter
89+
90+
|`velero_backup_items_errors`
91+
|Total number of errors encountered during backup
92+
|Gauge
93+
94+
|`velero_backup_items_total`
95+
|Total number of items backed up
96+
|Gauge
97+
98+
|`velero_backup_last_status`
99+
|Last status of the backup. A value of 1 is success, 0.
100+
|Gauge
101+
102+
|`velero_backup_last_successful_timestamp`
103+
|Last time a backup ran successfully, Unix timestamp in seconds
104+
|Gauge
105+
106+
|`velero_backup_partial_failure_total`
107+
|Total number of partially failed backups
108+
|Counter
109+
110+
|`velero_backup_success_total`
111+
|Total number of successful backups
112+
|Counter
113+
114+
|`velero_backup_tarball_size_bytes`
115+
|Size, in bytes, of a backup
116+
|Gauge
117+
118+
|`velero_backup_total`
119+
|Current number of existent backups
120+
|Gauge
121+
122+
|`velero_backup_validation_failure_total`
123+
|Total number of validation failed backups
124+
|Counter
125+
126+
|`velero_backup_warning_total`
127+
|Total number of warned backups
128+
|Counter
129+
130+
|`velero_csi_snapshot_attempt_total`
131+
|Total number of CSI attempted volume snapshots
132+
|Counter
133+
134+
|`velero_csi_snapshot_failure_total`
135+
|Total number of CSI failed volume snapshots
136+
|Counter
137+
138+
|`velero_csi_snapshot_success_total`
139+
|Total number of CSI successful volume snapshots
140+
|Counter
141+
142+
|`velero_restore_attempt_total`
143+
|Total number of attempted restores
144+
|Counter
145+
146+
|`velero_restore_failed_total`
147+
|Total number of failed restores
148+
|Counter
149+
150+
|`velero_restore_partial_failure_total`
151+
|Total number of partially failed restores
152+
|Counter
153+
154+
|`velero_restore_success_total`
155+
|Total number of successful restores
156+
|Counter
157+
158+
|`velero_restore_total`
159+
|Current number of existent restores
160+
|Gauge
161+
162+
|`velero_restore_validation_failed_total`
163+
|Total number of failed restores failing validations
164+
|Counter
165+
166+
|`velero_volume_snapshot_attempt_total`
167+
|Total number of attempted volume snapshots
168+
|Counter
169+
170+
|`velero_volume_snapshot_failure_total`
171+
|Total number of failed volume snapshots
172+
|Counter
173+
174+
|`velero_volume_snapshot_success_total`
175+
|Total number of successful volume snapshots
176+
|Counter
177+
178+
|===
179+

0 commit comments

Comments
 (0)