openshift
diff --git a/‎backup_and_restore/application_backup_and_restore/troubleshooting.adoc
Lines changed: 15 additions & 1 deletion b/‎backup_and_restore/application_backup_and_restore/troubleshooting.adoc
Lines changed: 15 additions & 1 deletion
diff --git a/‎images/oadp-backup-failing-alert.png
46.6 KB b/‎images/oadp-backup-failing-alert.png
46.6 KB
diff --git a/‎images/oadp-metrics-query.png
77.8 KB b/‎images/oadp-metrics-query.png
77.8 KB
diff --git a/‎images/oadp-metrics-targets.png
49.4 KB b/‎images/oadp-metrics-targets.png
49.4 KB
diff --git a/‎modules/migration-using-must-gather.adoc
Lines changed: 0 additions & 37 deletions b/‎modules/migration-using-must-gather.adoc
Lines changed: 0 additions & 37 deletions
diff --git a/‎modules/oadp-creating-alerting-rule.adoc
Lines changed: 66 additions & 0 deletions b/‎modules/oadp-creating-alerting-rule.adoc
Lines changed: 66 additions & 0 deletions
diff --git a/‎modules/oadp-creating-service-monitor.adoc
Lines changed: 76 additions & 0 deletions b/‎modules/oadp-creating-service-monitor.adoc
Lines changed: 76 additions & 0 deletions
diff --git a/‎modules/oadp-list-of-metrics.adoc
Lines changed: 179 additions & 0 deletions b/‎modules/oadp-list-of-metrics.adoc
Lines changed: 179 additions & 0 deletions
@@ -13,7 +13,7 @@ You can debug Velero custom resources (CRs) by using the xref:../../backup_and_r
 
 You can check xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-installation-issues_oadp-troubleshooting[installation issues], xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-backup-restore-cr-issues_oadp-troubleshooting[backup and restore CR issues], and xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#oadp-restic-issues_oadp-troubleshooting[Restic issues].
 
-You can collect logs, CR information, and Prometheus metric data by using the xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#migration-using-must-gather_oadp-troubleshooting[`must-gather` tool].
+You can collect logs and CR information by using the xref:../../backup_and_restore/application_backup_and_restore/troubleshooting.adoc#migration-using-must-gather_oadp-troubleshooting[`must-gather` tool].
 
 You can obtain the Velero CLI tool by:
 
@@ -89,5 +89,19 @@ include::modules/oadp-backup-restore-cr-issues.adoc[leveloffset=+1]
 include::modules/oadp-restic-issues.adoc[leveloffset=+1]
 
 include::modules/migration-using-must-gather.adoc[leveloffset=+1]
+include::modules/oadp-monitoring.adoc[leveloffset=+1]
+[role="_additional-resources"]
+.Additional resources
+* xref:../../monitoring/monitoring-overview.adoc#about-openshift-monitoring[Monitoring stack]
+
+include::modules/oadp-monitoring-setup.adoc[leveloffset=+2]
+include::modules/oadp-creating-service-monitor.adoc[leveloffset=+2]
+include::modules/oadp-creating-alerting-rule.adoc[leveloffset=+2]
+[role="_additional-resources"]
+.Additional resources
+* xref:../../monitoring/managing-alerts.adoc#managing-alerts[Managing alerts]
+
+include::modules/oadp-list-of-metrics.adoc[leveloffset=+2]
+include::modules/oadp-viewing-metrics-ui.adoc[leveloffset=+2]
 
 :!oadp-troubleshooting:
@@ -82,40 +82,3 @@ $ oc adm must-gather --image={must-gather} \
 +
 This operation can take a long time. The data is saved as `must-gather/metrics/prom_data.tar.gz`.
 
-[discrete]
-[id="viewing-data-with-prometheus-console_{context}"]
-== Viewing metrics data with the Prometheus console
-
-You can view the metrics data with the Prometheus console.
-
-.Procedure
-
-. Decompress the `prom_data.tar.gz` file:
-+
-[source,terminal]
-----
-$ tar -xvzf must-gather/metrics/prom_data.tar.gz
-----
-
-. Create a local Prometheus instance:
-+
-[source,terminal]
-----
-$ make prometheus-run
-----
-+
-The command outputs the Prometheus URL.
-+
-.Output
-[source,terminal]
-----
-Started Prometheus on http://localhost:9090
-----
-
-. Launch a web browser and navigate to the URL to view the data by using the Prometheus web console.
-. After you have viewed the data, delete the Prometheus instance and data:
-+
-[source,terminal]
-----
-$ make prometheus-cleanup
-----
@@ -0,0 +1,66 @@
+// Module included in the following assemblies:
+//
+// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
+
+:_content-type: PROCEDURE
+[id="creating-alerting-rules_{context}"]
+= Creating an alerting rule
+
+The {product-title} monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.
+
+.Procedure
+
+. Create a `PrometheusRule` YAML file with the sample `OADPBackupFailing` alert and save it as `4_create_oadp_alert_rule.yaml`.
++
+.Sample `OADPBackupFailing` alert
+[source,yaml]
++
+----
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: sample-oadp-alert
+  namespace: openshift-adp
+spec:
+  groups:
+  - name: sample-oadp-backup-alert
+    rules:
+    - alert: OADPBackupFailing
+      annotations:
+        description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
+        summary: OADP has issues creating backups
+      expr: |
+        increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
+      for: 5m
+      labels:
+        severity: warning
+----
++
+In this sample, the Alert displays under the following conditions:
++
+* There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.
+* If the time of the first increase is less than 5 minutes, the Alert will be in a `Pending` state, after which it will turn into a `Firing` state.
++
+. Apply the `4_create_oadp_alert_rule.yaml` file, which creates the `PrometheusRule` object in the `openshift-adp` namespace:
++
+[source,terminal]
+----
+$ oc apply -f 4_create_oadp_alert_rule.yaml
+----
++
+.Example output
+[source,terminal]
+----
+prometheusrule.monitoring.coreos.com/sample-oadp-alert created
+----
+
+.Verification
+. After the Alert is triggered, you can view it in the following ways:
+** In the *Developer* view, select the *Observe* menu.
+** In the *Administrator* view under *Observe* -> *Alerting* menu, select *User* in the *Filter* box. Otherwise, by default only the *Platform* Alerts are displayed.
++
+.OADP backup failing alert
+
+image::oadp-backup-failing-alert.png[OADP backup failing alert]
+
+
@@ -0,0 +1,76 @@
+// Module included in the following assemblies:
+//
+// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
+
+:_content-type: PROCEDURE
+[id="oadp-creating-service-monitor_{context}"]
+= Creating OADP service monitor
+
+OADP provides an `openshift-adp-velero-metrics-svc` service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.
+
+Get details about the service by running the following commands:
+
+.Procedure
+
+. Ensure the `openshift-adp-velero-metrics-svc` exists. It should contain `app.kubernetes.io/name=velero` label which will be used as selector for the `ServiceMonitor` object.
+
++
+[source,terminal]
+----
+$ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
+----
++
+.Example output
+[source,terminal]
+----
+NAME                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
+openshift-adp-velero-metrics-svc   ClusterIP   172.30.38.244   <none>        8085/TCP   1h
+----
++
+. Create a `ServiceMonitor` YAML file that matches the existing service label, and save the file as `3_create_oadp_service_monitor.yaml`. The service monitor is created in the `openshift-adp` namespace where the `openshift-adp-velero-metrics-svc` service resides.
++
+.Example `ServiceMonitor` object
+[source,yaml]
++
+----
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  labels:
+    app: oadp-service-monitor
+  name: oadp-service-monitor
+  namespace: openshift-adp
+spec:
+  endpoints:
+  - interval: 30s
+    path: /metrics
+    targetPort: 8085
+    scheme: http
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: "velero"
+----
++
+. Apply the `3_create_oadp_service_monitor.yaml` file:
++
+[source,terminal]
+----
+$ oc apply -f 3_create_oadp_service_monitor.yaml
+----
++
+.Example output
+[source,terminal]
+----
+servicemonitor.monitoring.coreos.com/oadp-service-monitor created
+----
+
+.Verification
+
+* Confirm that the new service monitor is in an *Up* state by using the *Administrator* perspective of the {product-title} web console:
+.. Navigate to the *Observe* -> *Targets* page.
+.. Ensure the *Filter* is unselected or that the *User* source is selected and type `openshift-adp` in the `Text` search field.
+.. Verify that the status for the *Status* for the service monitor is *Up*.
++
+.OADP metrics targets
+
+image::oadp-metrics-targets.png[OADP metrics targets]
@@ -0,0 +1,179 @@
+// Module included in the following assemblies:
+//
+// * backup_and_restore/application_backup_and_restore/troubleshooting.adoc
+
+:_content-type: REFERENCE
+[id="list-of-metrics_{context}"]
+= List of available metrics
+
+These are the list of metrics provided by the OADP together with their https://prometheus.io/docs/concepts/metric_types/[Types].
+
+|===
+|Metric name |Description |Type
+
+|`kopia_content_cache_hit_bytes`
+|Number of bytes retrieved from the cache
+|Counter
+
+|`kopia_content_cache_hit_count`
+|Number of times content was retrieved from the cache
+|Counter
+
+|`kopia_content_cache_malformed`
+|Number of times malformed content was read from the cache
+|Counter
+
+|`kopia_content_cache_miss_count`
+|Number of times content was not found in the cache and fetched
+|Counter
+
+|`kopia_content_cache_missed_bytes`
+|Number of bytes retrieved from the underlying storage
+|Counter
+
+|`kopia_content_cache_miss_error_count`
+|Number of times content could not be found in the underlying storage
+|Counter
+
+|`kopia_content_cache_store_error_count`
+|Number of times content could not be saved in the cache
+|Counter
+
+|`kopia_content_get_bytes`
+|Number of bytes retrieved using `GetContent()`
+|Counter
+
+|`kopia_content_get_count`
+|Number of times `GetContent()` was called
+|Counter
+
+|`kopia_content_get_error_count`
+|Number of times `GetContent()` was called and the result was an error
+|Counter
+
+|`kopia_content_get_not_found_count`
+|Number of times `GetContent()` was called and the result was not found
+|Counter
+
+|`kopia_content_write_bytes`
+|Number of bytes passed to `WriteContent()`
+|Counter
+
+|`kopia_content_write_count`
+|Number of times `WriteContent()` was called
+|Counter
+
+|`velero_backup_attempt_total`
+|Total number of attempted backups
+|Counter
+
+|`velero_backup_deletion_attempt_total`
+|Total number of attempted backup deletions
+|Counter
+
+|`velero_backup_deletion_failure_total`
+|Total number of failed backup deletions
+|Counter
+
+|`velero_backup_deletion_success_total`
+|Total number of successful backup deletions
+|Counter
+
+|`velero_backup_duration_seconds`
+|Time taken to complete backup, in seconds
+|Histogram
+
+|`velero_backup_failure_total`
+|Total number of failed backups
+|Counter
+
+|`velero_backup_items_errors`
+|Total number of errors encountered during backup
+|Gauge
+
+|`velero_backup_items_total`
+|Total number of items backed up
+|Gauge
+
+|`velero_backup_last_status`
+|Last status of the backup. A value of 1 is success, 0.
+|Gauge
+
+|`velero_backup_last_successful_timestamp`
+|Last time a backup ran successfully, Unix timestamp in seconds
+|Gauge
+
+|`velero_backup_partial_failure_total`
+|Total number of partially failed backups
+|Counter
+
+|`velero_backup_success_total`
+|Total number of successful backups
+|Counter
+
+|`velero_backup_tarball_size_bytes`
+|Size, in bytes, of a backup
+|Gauge
+
+|`velero_backup_total`
+|Current number of existent backups
+|Gauge
+
+|`velero_backup_validation_failure_total`
+|Total number of validation failed backups
+|Counter
+
+|`velero_backup_warning_total`
+|Total number of warned backups
+|Counter
+
+|`velero_csi_snapshot_attempt_total`
+|Total number of CSI attempted volume snapshots
+|Counter
+
+|`velero_csi_snapshot_failure_total`
+|Total number of CSI failed volume snapshots
+|Counter
+
+|`velero_csi_snapshot_success_total`
+|Total number of CSI successful volume snapshots
+|Counter
+
+|`velero_restore_attempt_total`
+|Total number of attempted restores
+|Counter
+
+|`velero_restore_failed_total`
+|Total number of failed restores
+|Counter
+
+|`velero_restore_partial_failure_total`
+|Total number of partially failed restores
+|Counter
+
+|`velero_restore_success_total`
+|Total number of successful restores
+|Counter
+
+|`velero_restore_total`
+|Current number of existent restores
+|Gauge
+
+|`velero_restore_validation_failed_total`
+|Total number of failed restores failing validations
+|Counter
+
+|`velero_volume_snapshot_attempt_total`
+|Total number of attempted volume snapshots
+|Counter
+
+|`velero_volume_snapshot_failure_total`
+|Total number of failed volume snapshots
+|Counter
+
+|`velero_volume_snapshot_success_total`
+|Total number of successful volume snapshots
+|Counter
+
+|===
+