You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/oadp_monitoring.md
+44-36Lines changed: 44 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@
4
4
5
5
## Preface
6
6
7
-
The OpenShift Container Platform provides a [monitoring stack](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/index) that allows users and administrators to effectively monitor and manage their OpenShift clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters including receiving alerts when some events occurs.
7
+
The OpenShift Container Platform provides a [monitoring stack](https://access.redhat.com/documentation/en-us/openshift_container_platform/latest/html/monitoring/index) that allows users and administrators to effectively monitor and manage their OpenShift clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters including receiving alerts when some events occurs.
8
8
9
-
The OADP Operator leverages an OpenShift [User Workload Monitoring](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/enabling-monitoring-for-user-defined-projects) provided by the OpenShift Monitoring Stack for retrieving number of [metrics](#metrics) from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics using the OpenShift Metrics query front-end.
9
+
The OADP Operator leverages an OpenShift [User Workload Monitoring](https://access.redhat.com/documentation/en-us/openshift_container_platform/latest/html/monitoring/configuring-user-workload-monitoring) provided by the OpenShift Monitoring Stack for retrieving number of [metrics](#metrics) from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics using the OpenShift Metrics query front-end.
10
10
11
11
With enabled User Workload Monitoring it is also possible to configure and use any Prometheus-compatible third-party UI, such as Grafana to visualize Velero metrics. Please note that the usage of third-party UIs falls outside the scope of this document.
12
12
@@ -26,7 +26,7 @@ Monitoring [metrics](#metrics) requires enabling monitoring for the user-defined
26
26
27
27
### Enable and Configure User Workload Monitoring
28
28
29
-
This paragraph will provide a short set of instructions how to enable user workload monitoring for an OADP project in the cluster. For comprehensive set of configuration options refer to the [enabling monitoring for user-defined projects](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/enabling-monitoring-for-user-defined-projects#doc-wrapper) documentation.
29
+
This paragraph will provide a short set of instructions how to enable user workload monitoring for an OADP project in the cluster. For comprehensive set of configuration options refer to the [enabling monitoring for user-defined projects](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/monitoring/configuring-user-workload-monitoring#enabling-monitoring-for-user-defined-projects_preparing-to-configure-the-monitoring-stack-uwm) documentation.
30
30
31
31
32
32
1. Edit the `cluster-monitoring-config` ConfigMap object in the `openshift-monitoring` namespace and add or enable the `enableUserWorkload` option under `data/config.yaml`.
@@ -38,10 +38,10 @@ This paragraph will provide a short set of instructions how to enable user workl
38
38
39
39
```yaml
40
40
apiVersion: v1
41
+
kind: ConfigMap
41
42
data:
42
43
config.yaml: |
43
44
enableUserWorkload: true# Add this option or set to true
44
-
kind: ConfigMap
45
45
metadata:
46
46
# [...]
47
47
```
@@ -183,51 +183,59 @@ Please refer to the OpenShift documentation for detailed instructions on how to
183
183
184
184
### List of available metrics
185
185
186
+
- `velero`: Used for general Velero metrics
187
+
- `podVolume`: Used for Pod Volume Backup metrics
188
+
186
189
Following is the list of metrics provided by the OADP together with their [Types](https://prometheus.io/docs/concepts/metric_types/)
187
190
191
+
#### `velero` metrics
192
+
188
193
| Metric Name | Description | Type |
189
194
| ----------- | ----------- | --- |
190
-
| kopia_content_cache_hit_bytes | Number of bytes retrieved from the cache | Counter |
191
-
| kopia_content_cache_hit_count | Number of time content was retrieved from the cache | Counter |
192
-
| kopia_content_cache_malformed | Number of times malformed content was read from the cache | Counter |
193
-
| kopia_content_cache_miss_count | Number of time content was not found in the cache and fetched | Counter |
194
-
| kopia_content_cache_missed_bytes | Number of bytes retrieved from the underlying storage | Counter |
195
-
| kopia_content_cache_miss_error_count | Number of time content could not be found in the underlying storage | Counter |
196
-
| kopia_content_cache_store_error_count | Number of time content could not be saved in the cache | Counter |
197
-
| kopia_content_get_bytes | Number of bytes retrieved using GetContent | Counter |
198
-
| kopia_content_get_count | Number of timeGetContent() was called | Counter |
199
-
| kopia_content_get_error_count | Number of timeGetContent() was called and the result was an error | Counter |
200
-
| kopia_content_get_not_found_count | Number of timeGetContent() was called and the result was not found | Counter |
201
-
| kopia_content_write_bytes | Number of bytes passed to WriteContent() | Counter |
202
-
| kopia_content_write_count | Number of timeWriteContent() was called | Counter |
195
+
| velero_backup_tarball_size_bytes | Size, in bytes, of a backup | Gauge |
196
+
| velero_backup_total | Current number of existent backups | Gauge |
203
197
| velero_backup_attempt_total | Total number of attempted backups | Counter |
198
+
| velero_backup_success_total | Total number of successful backups | Counter |
199
+
| velero_backup_partial_failure_total | Total number of partially failed backups | Counter |
200
+
| velero_backup_failure_total | Total number of failed backups | Counter |
201
+
| velero_backup_validation_failure_total | Total number of validation failed backups | Counter |
202
+
| velero_backup_duration_seconds | Time taken to complete backup, in seconds | Histogram |
204
203
| velero_backup_deletion_attempt_total | Total number of attempted backup deletions | Counter |
205
-
| velero_backup_deletion_failure_total | Total number of failed backup deletions | Counter |
206
204
| velero_backup_deletion_success_total | Total number of successful backup deletions | Counter |
207
-
| velero_backup_duration_seconds | Time taken to complete backup, in seconds | Histogram |
208
-
| velero_backup_failure_total | Total number of failed backups | Counter |
209
-
| velero_backup_items_errors | Total number of errors encountered during backup | Gauge |
210
-
| velero_backup_items_total | Total number of items backed up | Gauge |
211
-
| velero_backup_last_status | Last status of the backup. A value of 1 is success, 0 | Gauge |
205
+
| velero_backup_deletion_failure_total | Total number of failed backup deletions | Counter |
212
206
| velero_backup_last_successful_timestamp | Last time a backup ran successfully, Unix timestamp in seconds | Gauge |
213
-
| velero_backup_partial_failure_total | Total number of partially failed backups | Counter |
214
-
| velero_backup_success_total | Total number of successful backups | Counter |
215
-
| velero_backup_tarball_size_bytes | Size, in bytes, of a backup | Gauge |
216
-
| velero_backup_total | Current number of existent backups | Gauge |
217
-
| velero_backup_validation_failure_total | Total number of validation failed backups | Counter |
207
+
| velero_backup_items_total | Total number of items backed up | Gauge |
208
+
| velero_backup_items_errors | Total number of errors encountered during backup | Gauge |
218
209
| velero_backup_warning_total | Total number of warned backups | Counter |
219
-
| velero_csi_snapshot_attempt_total | Total number of CSI attempted volume snapshots | Counter |
220
-
| velero_csi_snapshot_failure_total | Total number of CSI failed volume snapshots | Counter |
221
-
| velero_csi_snapshot_success_total | Total number of CSI successful volume snapshots | Counter |
222
-
| velero_restore_attempt_total | Total number of attempted restores | Counter |
223
-
| velero_restore_failed_total | Total number of failed restores | Counter |
224
-
| velero_restore_partial_failure_total | Total number of partially failed restores | Counter |
225
-
| velero_restore_success_total | Total number of successful restores | Counter |
210
+
| velero_backup_last_status | Last status of the backup. A value of 1 is success, 0 is failure | Gauge |
226
211
| velero_restore_total | Current number of existent restores | Gauge |
212
+
| velero_restore_attempt_total | Total number of attempted restores | Counter |
227
213
| velero_restore_validation_failed_total | Total number of failed restores failing validations | Counter |
214
+
| velero_restore_success_total | Total number of successful restores | Counter |
215
+
| velero_restore_partial_failure_total | Total number of partially failed restores | Counter |
216
+
| velero_restore_failed_total | Total number of failed restores | Counter |
228
217
| velero_volume_snapshot_attempt_total | Total number of attempted volume snapshots | Counter |
229
-
| velero_volume_snapshot_failure_total | Total number of failed volume snapshots | Counter |
230
218
| velero_volume_snapshot_success_total | Total number of successful volume snapshots | Counter |
219
+
| velero_volume_snapshot_failure_total | Total number of failed volume snapshots | Counter |
220
+
| velero_csi_snapshot_attempt_total | Total number of CSI attempted volume snapshots | Counter |
221
+
| velero_csi_snapshot_success_total | Total number of CSI successful volume snapshots | Counter |
222
+
| velero_csi_snapshot_failure_total | Total number of CSI failed volume snapshots | Counter |
223
+
224
+
#### `podVolume` metrics
225
+
226
+
| Metric Name | Description | Type |
227
+
| ----------- | ----------- | --- |
228
+
| podVolume_pod_volume_backup_enqueue_count | Total number of pod_volume_backup objects enqueued | Counter |
229
+
| podVolume_pod_volume_backup_dequeue_count | Total number of pod_volume_backup objects dequeued | Counter |
230
+
| podVolume_pod_volume_operation_latency_seconds | Time taken to complete pod volume operations, in seconds | Histogram |
231
+
| podVolume_pod_volume_operation_latency_seconds_gauge | Gauge metric indicating time taken, in seconds, to perform pod volume operations | Gauge |
232
+
| podVolume_data_upload_success_total | Total number of successful uploaded snapshots | Counter |
233
+
| podVolume_data_upload_failure_total | Total number of failed uploaded snapshots | Counter |
234
+
| podVolume_data_upload_cancel_total | Total number of canceled uploaded snapshots | Counter |
235
+
| podVolume_data_download_success_total | Total number of successful downloaded snapshots | Counter |
236
+
| podVolume_data_download_failure_total | Total number of failed downloaded snapshots | Counter |
237
+
| podVolume_data_download_cancel_total | Total number of canceled downloaded snapshots | Counter |
0 commit comments