Skip to content

Commit 0efbaf2

Browse files
authored
Update OADP Monitoring documentation. (#1821)
Signed-off-by: Michal Pryc <[email protected]>
1 parent 2a3ef4c commit 0efbaf2

File tree

1 file changed

+44
-36
lines changed

1 file changed

+44
-36
lines changed

docs/oadp_monitoring.md

Lines changed: 44 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44

55
## Preface
66

7-
The OpenShift Container Platform provides a [monitoring stack](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/index) that allows users and administrators to effectively monitor and manage their OpenShift clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters including receiving alerts when some events occurs.
7+
The OpenShift Container Platform provides a [monitoring stack](https://access.redhat.com/documentation/en-us/openshift_container_platform/latest/html/monitoring/index) that allows users and administrators to effectively monitor and manage their OpenShift clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters including receiving alerts when some events occurs.
88

9-
The OADP Operator leverages an OpenShift [User Workload Monitoring](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/enabling-monitoring-for-user-defined-projects) provided by the OpenShift Monitoring Stack for retrieving number of [metrics](#metrics) from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics using the OpenShift Metrics query front-end.
9+
The OADP Operator leverages an OpenShift [User Workload Monitoring](https://access.redhat.com/documentation/en-us/openshift_container_platform/latest/html/monitoring/configuring-user-workload-monitoring) provided by the OpenShift Monitoring Stack for retrieving number of [metrics](#metrics) from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics using the OpenShift Metrics query front-end.
1010

1111
With enabled User Workload Monitoring it is also possible to configure and use any Prometheus-compatible third-party UI, such as Grafana to visualize Velero metrics. Please note that the usage of third-party UIs falls outside the scope of this document.
1212

@@ -26,7 +26,7 @@ Monitoring [metrics](#metrics) requires enabling monitoring for the user-defined
2626

2727
### Enable and Configure User Workload Monitoring
2828

29-
This paragraph will provide a short set of instructions how to enable user workload monitoring for an OADP project in the cluster. For comprehensive set of configuration options refer to the [enabling monitoring for user-defined projects](https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/enabling-monitoring-for-user-defined-projects#doc-wrapper) documentation.
29+
This paragraph will provide a short set of instructions how to enable user workload monitoring for an OADP project in the cluster. For comprehensive set of configuration options refer to the [enabling monitoring for user-defined projects](https://docs.redhat.com/en/documentation/openshift_container_platform/latest/html/monitoring/configuring-user-workload-monitoring#enabling-monitoring-for-user-defined-projects_preparing-to-configure-the-monitoring-stack-uwm) documentation.
3030

3131

3232
1. Edit the `cluster-monitoring-config` ConfigMap object in the `openshift-monitoring` namespace and add or enable the `enableUserWorkload` option under `data/config.yaml`.
@@ -38,10 +38,10 @@ This paragraph will provide a short set of instructions how to enable user workl
3838

3939
```yaml
4040
apiVersion: v1
41+
kind: ConfigMap
4142
data:
4243
config.yaml: |
4344
enableUserWorkload: true # Add this option or set to true
44-
kind: ConfigMap
4545
metadata:
4646
# [...]
4747
```
@@ -183,51 +183,59 @@ Please refer to the OpenShift documentation for detailed instructions on how to
183183

184184
### List of available metrics
185185

186+
- `velero`: Used for general Velero metrics
187+
- `podVolume`: Used for Pod Volume Backup metrics
188+
186189
Following is the list of metrics provided by the OADP together with their [Types](https://prometheus.io/docs/concepts/metric_types/)
187190

191+
#### `velero` metrics
192+
188193
| Metric Name | Description | Type |
189194
| ----------- | ----------- | --- |
190-
| kopia_content_cache_hit_bytes | Number of bytes retrieved from the cache | Counter |
191-
| kopia_content_cache_hit_count | Number of time content was retrieved from the cache | Counter |
192-
| kopia_content_cache_malformed | Number of times malformed content was read from the cache | Counter |
193-
| kopia_content_cache_miss_count | Number of time content was not found in the cache and fetched | Counter |
194-
| kopia_content_cache_missed_bytes | Number of bytes retrieved from the underlying storage | Counter |
195-
| kopia_content_cache_miss_error_count | Number of time content could not be found in the underlying storage | Counter |
196-
| kopia_content_cache_store_error_count | Number of time content could not be saved in the cache | Counter |
197-
| kopia_content_get_bytes | Number of bytes retrieved using GetContent | Counter |
198-
| kopia_content_get_count | Number of time GetContent() was called | Counter |
199-
| kopia_content_get_error_count | Number of time GetContent() was called and the result was an error | Counter |
200-
| kopia_content_get_not_found_count | Number of time GetContent() was called and the result was not found | Counter |
201-
| kopia_content_write_bytes | Number of bytes passed to WriteContent() | Counter |
202-
| kopia_content_write_count | Number of time WriteContent() was called | Counter |
195+
| velero_backup_tarball_size_bytes | Size, in bytes, of a backup | Gauge |
196+
| velero_backup_total | Current number of existent backups | Gauge |
203197
| velero_backup_attempt_total | Total number of attempted backups | Counter |
198+
| velero_backup_success_total | Total number of successful backups | Counter |
199+
| velero_backup_partial_failure_total | Total number of partially failed backups | Counter |
200+
| velero_backup_failure_total | Total number of failed backups | Counter |
201+
| velero_backup_validation_failure_total | Total number of validation failed backups | Counter |
202+
| velero_backup_duration_seconds | Time taken to complete backup, in seconds | Histogram |
204203
| velero_backup_deletion_attempt_total | Total number of attempted backup deletions | Counter |
205-
| velero_backup_deletion_failure_total | Total number of failed backup deletions | Counter |
206204
| velero_backup_deletion_success_total | Total number of successful backup deletions | Counter |
207-
| velero_backup_duration_seconds | Time taken to complete backup, in seconds | Histogram |
208-
| velero_backup_failure_total | Total number of failed backups | Counter |
209-
| velero_backup_items_errors | Total number of errors encountered during backup | Gauge |
210-
| velero_backup_items_total | Total number of items backed up | Gauge |
211-
| velero_backup_last_status | Last status of the backup. A value of 1 is success, 0 | Gauge |
205+
| velero_backup_deletion_failure_total | Total number of failed backup deletions | Counter |
212206
| velero_backup_last_successful_timestamp | Last time a backup ran successfully, Unix timestamp in seconds | Gauge |
213-
| velero_backup_partial_failure_total | Total number of partially failed backups | Counter |
214-
| velero_backup_success_total | Total number of successful backups | Counter |
215-
| velero_backup_tarball_size_bytes | Size, in bytes, of a backup | Gauge |
216-
| velero_backup_total | Current number of existent backups | Gauge |
217-
| velero_backup_validation_failure_total | Total number of validation failed backups | Counter |
207+
| velero_backup_items_total | Total number of items backed up | Gauge |
208+
| velero_backup_items_errors | Total number of errors encountered during backup | Gauge |
218209
| velero_backup_warning_total | Total number of warned backups | Counter |
219-
| velero_csi_snapshot_attempt_total | Total number of CSI attempted volume snapshots | Counter |
220-
| velero_csi_snapshot_failure_total | Total number of CSI failed volume snapshots | Counter |
221-
| velero_csi_snapshot_success_total | Total number of CSI successful volume snapshots | Counter |
222-
| velero_restore_attempt_total | Total number of attempted restores | Counter |
223-
| velero_restore_failed_total | Total number of failed restores | Counter |
224-
| velero_restore_partial_failure_total | Total number of partially failed restores | Counter |
225-
| velero_restore_success_total | Total number of successful restores | Counter |
210+
| velero_backup_last_status | Last status of the backup. A value of 1 is success, 0 is failure | Gauge |
226211
| velero_restore_total | Current number of existent restores | Gauge |
212+
| velero_restore_attempt_total | Total number of attempted restores | Counter |
227213
| velero_restore_validation_failed_total | Total number of failed restores failing validations | Counter |
214+
| velero_restore_success_total | Total number of successful restores | Counter |
215+
| velero_restore_partial_failure_total | Total number of partially failed restores | Counter |
216+
| velero_restore_failed_total | Total number of failed restores | Counter |
228217
| velero_volume_snapshot_attempt_total | Total number of attempted volume snapshots | Counter |
229-
| velero_volume_snapshot_failure_total | Total number of failed volume snapshots | Counter |
230218
| velero_volume_snapshot_success_total | Total number of successful volume snapshots | Counter |
219+
| velero_volume_snapshot_failure_total | Total number of failed volume snapshots | Counter |
220+
| velero_csi_snapshot_attempt_total | Total number of CSI attempted volume snapshots | Counter |
221+
| velero_csi_snapshot_success_total | Total number of CSI successful volume snapshots | Counter |
222+
| velero_csi_snapshot_failure_total | Total number of CSI failed volume snapshots | Counter |
223+
224+
#### `podVolume` metrics
225+
226+
| Metric Name | Description | Type |
227+
| ----------- | ----------- | --- |
228+
| podVolume_pod_volume_backup_enqueue_count | Total number of pod_volume_backup objects enqueued | Counter |
229+
| podVolume_pod_volume_backup_dequeue_count | Total number of pod_volume_backup objects dequeued | Counter |
230+
| podVolume_pod_volume_operation_latency_seconds | Time taken to complete pod volume operations, in seconds | Histogram |
231+
| podVolume_pod_volume_operation_latency_seconds_gauge | Gauge metric indicating time taken, in seconds, to perform pod volume operations | Gauge |
232+
| podVolume_data_upload_success_total | Total number of successful uploaded snapshots | Counter |
233+
| podVolume_data_upload_failure_total | Total number of failed uploaded snapshots | Counter |
234+
| podVolume_data_upload_cancel_total | Total number of canceled uploaded snapshots | Counter |
235+
| podVolume_data_download_success_total | Total number of successful downloaded snapshots | Counter |
236+
| podVolume_data_download_failure_total | Total number of failed downloaded snapshots | Counter |
237+
| podVolume_data_download_cancel_total | Total number of canceled downloaded snapshots | Counter |
238+
231239

232240
### Viewing metrics using OpenShift Observe UI
233241

0 commit comments

Comments
 (0)