You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/integrate/redis-data-integration/observability.md
+53-53Lines changed: 53 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,39 +52,39 @@ The following table lists all collector metrics and their descriptions:
52
52
| Metric | Type | Description | Alerting Recommendations |
53
53
|:--|:--|:--|:--|
54
54
|**Schema History Metrics**||||
55
-
| ChangesApplied | Counter | Total number of schema changes applied during recovery and runtime | Informational - monitor for trends |
56
-
| ChangesRecovered | Counter | Number of changes that were read during the recovery phase | Informational - monitor for trends |
57
-
| MilliSecondsSinceLastAppliedChange | Gauge | Number of milliseconds since the last change was applied | Informational - monitor for trends |
58
-
| MilliSecondsSinceLastRecoveredChange | Gauge | Number of milliseconds since the last change was recovered from the history store | Informational - monitor for trends |
59
-
| RecoveryStartTime | Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Informational - monitor for trends |
55
+
|`ChangesApplied`| Counter | Total number of schema changes applied during recovery and runtime | Informational - monitor for trends |
56
+
|`ChangesRecovered`| Counter | Number of changes that were read during the recovery phase | Informational - monitor for trends |
57
+
|`MilliSecondsSinceLastAppliedChange`| Gauge | Number of milliseconds since the last change was applied | Informational - monitor for trends |
58
+
|`MilliSecondsSinceLastRecoveredChange`| Gauge | Number of milliseconds since the last change was recovered from the history store | Informational - monitor for trends |
59
+
|`RecoveryStartTime`| Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Informational - monitor for trends |
60
60
|**Connection and State Metrics**||||
61
-
| Connected | Gauge | Whether the connector is currently connected to the database (1=connected, 0=disconnected) |**Critical Alert**: Alert if value = 0 (disconnected) |
61
+
|`Connected`| Gauge | Whether the collector is currently connected to the database (1=connected, 0=disconnected) |**Critical Alert**: Alert if value = 0 (disconnected) |
62
62
|**Queue Metrics**||||
63
-
| CurrentQueueSizeInBytes | Gauge | Current size of the connector's internal queue in bytes | Informational - monitor for trends |
64
-
| MaxQueueSizeInBytes | Gauge | Maximum configured size of the connector's internal queue in bytes | Informational - use for capacity planning |
65
-
| QueueRemainingCapacity | Gauge | Remaining capacity of the connector's internal queue | Informational - monitor for trends |
66
-
| QueueTotalCapacity | Gauge | Total capacity of the connector's internal queue | Informational - use for capacity planning |
63
+
|`CurrentQueueSizeInBytes`| Gauge | Current size of the collector's internal queue in bytes | Informational - monitor for trends |
64
+
|`MaxQueueSizeInBytes`| Gauge | Maximum configured size of the collector's internal queue in bytes | Informational - use for capacity planning |
65
+
|`QueueRemainingCapacity`| Gauge | Remaining capacity of the collector's internal queue | Informational - monitor for trends |
66
+
|`QueueTotalCapacity`| Gauge | Total capacity of the collector's internal queue | Informational - use for capacity planning |
67
67
|**Streaming Performance Metrics**||||
68
-
| MilliSecondsBehindSource | Gauge | Number of milliseconds the connector is behind the source database (-1 if not applicable) | Informational - monitor for trends and business SLA requirements |
69
-
| MilliSecondsSinceLastEvent | Gauge | Number of milliseconds since the connector processed the most recent event (-1 if not applicable) | Informational - monitor for trends in active systems |
70
-
| NumberOfCommittedTransactions | Counter | Number of committed transactions processed by the connector| Informational - monitor for trends |
71
-
| NumberOfEventsFiltered | Counter | Number of events filtered by include/exclude list rules | Informational - monitor for trends |
68
+
|`MilliSecondsBehindSource`| Gauge | Number of milliseconds the collector is behind the source database (-1 if not applicable) | Informational - monitor for trends and business SLA requirements |
69
+
|`MilliSecondsSinceLastEvent`| Gauge | Number of milliseconds since the collector processed the most recent event (-1 if not applicable) | Informational - monitor for trends in active systems |
70
+
|`NumberOfCommittedTransactions`| Counter | Number of committed transactions processed by the collector| Informational - monitor for trends |
71
+
|`NumberOfEventsFiltered`| Counter | Number of events filtered by include/exclude list rules | Informational - monitor for trends |
72
72
|**Event Counters**||||
73
-
| TotalNumberOfCreateEventsSeen | Counter | Total number of CREATE (INSERT) events seen by the connector| Informational - monitor for trends |
74
-
| TotalNumberOfDeleteEventsSeen | Counter | Total number of DELETE events seen by the connector| Informational - monitor for trends |
75
-
| TotalNumberOfEventsSeen | Counter | Total number of events seen by the connector| Informational - monitor for trends |
76
-
| TotalNumberOfUpdateEventsSeen | Counter | Total number of UPDATE events seen by the connector| Informational - monitor for trends |
77
-
| NumberOfErroneousEvents | Counter | Number of events that caused errors during processing |**Critical Alert**: Alert if > 0 (indicates processing failures) |
73
+
|`TotalNumberOfCreateEventsSeen`| Counter | Total number of CREATE (INSERT) events seen by the collector| Informational - monitor for trends |
74
+
|`TotalNumberOfDeleteEventsSeen`| Counter | Total number of DELETE events seen by the collector| Informational - monitor for trends |
75
+
|`TotalNumberOfEventsSeen`| Counter | Total number of events seen by the collector| Informational - monitor for trends |
76
+
|`TotalNumberOfUpdateEventsSeen`| Counter | Total number of UPDATE events seen by the collector| Informational - monitor for trends |
77
+
|`NumberOfErroneousEvents`| Counter | Number of events that caused errors during processing |**Critical Alert**: Alert if > 0 (indicates processing failures) |
78
78
|**Snapshot Metrics**||||
79
-
| RemainingTableCount | Gauge | Number of tables remaining to be processed during snapshot | Informational - monitor snapshot progress |
80
-
| RowsScanned | Counter | Number of rows scanned per table during snapshot (reported per table) | Informational - monitor snapshot progress |
81
-
| SnapshotAborted | Gauge | Whether the snapshot was aborted (1=aborted, 0=not aborted) |**Critical Alert**: Alert if value = 1 (snapshot failed) |
|`SnapshotDurationInSeconds`| Gauge | Total duration of the snapshot process in seconds | Informational - monitor for performance trends |
84
+
|`SnapshotPaused`| Gauge | Whether the snapshot is currently paused (1=paused, 0=not paused) | Informational - monitor snapshot state |
85
+
|`SnapshotPausedDurationInSeconds`| Gauge | Total time the snapshot was paused in seconds | Informational - monitor snapshot state |
86
+
|`SnapshotRunning`| Gauge | Whether a snapshot is currently running (1=running, 0=not running) | Informational - monitor snapshot state |
87
+
|`TotalTableCount`| Gauge | Total number of tables included in the snapshot | Informational - use for progress calculation |
88
88
89
89
{{< note >}}
90
90
Many metrics include context labels that specify the phase (`snapshot` or `streaming`), database name, and other contextual information. Metrics with a value of `-1` typically indicate that the measurement is not applicable in the current state.
@@ -117,50 +117,50 @@ RDI reports with their descriptions.
117
117
{{< note >}}
118
118
**Additional information about stream processor metrics:**
119
119
120
-
-The `rdi_` prefix comes from the Kubernetes namespace where RDI is installed. For VM install it is always this value.
121
-
- Metrics with `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
122
-
- The `rdi_incoming_entries` metric provides detailed breakdown by operation type for each data source.
120
+
-Where the metric name has the `rdi_` prefix, this will be replaced by the Kubernetes namespace name if you supplied a custom name during installation. The prefix is always `rdi_` for VM installations.
121
+
- Metrics with the `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
122
+
- The `rdi_incoming_entries` metric provides a detailed breakdown for each data source by operation type.
123
123
- The `rdi_stream_event_latency_ms` metric helps monitor data freshness and processing delays.
124
124
{{< /note >}}
125
125
126
126
## Recommended alerting strategy
127
127
128
-
The following alerting strategy focuses on system failures and data integrity issues that require immediate attention. Most metrics are informational and should be monitored for trends rather than triggering alerts.
128
+
The alerting strategy described in the sections below focuses on system failures and data integrity issues that require immediate attention. Most ther metrics are informational, so you should monitor them for trends rather than trigger alerts.
129
129
130
130
### Critical alerts (immediate response required)
131
131
132
-
These are the only alerts that should wake someone up or require immediate action:
132
+
These are the only alerts that require immediate action:
133
133
134
-
-**`Connected = 0`**: Database connectivity lost - RDI cannot function without database connection
135
-
-**`NumberOfErroneousEvents > 0`**: Data processing errors occurring - indicates data corruption or processing failures
136
-
-**`rejected_records_total > 0`**: Records being rejected - indicates data quality issues or processing failures
137
-
-**`SnapshotAborted = 1`**: Snapshot process failed - initial sync is incomplete
138
-
-**`rdi_engine_state`**: Alert only if the state indicates a clear failure condition (not just "not running")
134
+
-**`Connected = 0`**: Database connectivity has been lost. RDI cannot function without a database connection.
135
+
-**`NumberOfErroneousEvents > 0`**: Errors are occurring during data processing. This indicates data corruption or processing failures.
136
+
-**`rejected_records_total > 0`**: Records are being rejected. This indicates data quality issues or processing failures.
137
+
-**`SnapshotAborted = 1`**: The snapshot process has failed, so the initial sync is incomplete.
138
+
-**`rdi_engine_state`**: This is an alert only if the state indicates a clear failure condition (not just "not running").
139
139
140
140
### Important monitoring (but not alerts)
141
141
142
-
These metrics should be monitored on dashboards and reviewed regularly, but do not require automated alerts:
142
+
You should monitor these metrics on dashboards and review them regularly, but they don't require automated alerts:
143
143
144
-
-**Queue metrics**: Queue utilization can vary widely and hitting 0% or 100% capacity may be normal during certain operations
145
-
-**Latency metrics**: Lag and processing times depend heavily on business requirements and normal operational patterns
146
-
-**Event counters**: Event rates naturally vary based on application usage patterns
147
-
-**Snapshot progress**: Snapshot duration and progress depend on data size and are typically monitored manually
148
-
-**Schema changes**: Schema change frequency is highly application-dependent
144
+
-**Queue metrics**: Queue utilization can vary widely and hitting 0% or 100% capacity may be normal during certain operations.
145
+
-**Latency metrics**: Lag and processing times depend heavily on business requirements and normal operational patterns.
146
+
-**Event counters**: Event rates naturally vary based on application usage patterns.
147
+
-**Snapshot progress**: Snapshot duration and progress depend on data size, so you should typically monitor them manually.
148
+
-**Schema changes**: Schema change frequency is highly application-dependent.
149
149
150
150
### Key principles for RDI alerting
151
151
152
-
1.**Alert on failures, not performance**: Focus alerts on system failures rather than performance degradation
153
-
2.**Business context matters**: Latency and throughput requirements vary significantly between organizations
154
-
3.**Establish baselines first**: Monitor metrics for weeks before setting any threshold-based alerts
155
-
4.**Avoid alert fatigue**: Too many alerts reduce response to truly critical issues
156
-
5.**Use dashboards for trends**: Most metrics are better suited for dashboard monitoring than alerting
152
+
-**Alert on failures, not performance**: Focus alerts on system failures rather than performance degradation.
153
+
-**Business context matters**: Latency and throughput requirements vary significantly between organizations.
154
+
-**Establish baselines first**: Monitor metrics for weeks before you set any threshold-based alerts.
155
+
-**Avoid alert fatigue**: If you see too many non-critical alerts, you are less likely to take truly critical issues seriously.
156
+
-**Use dashboards for trends**: Most metrics are better suited for dashboard monitoring than alerting
157
157
158
158
### Monitoring best practices
159
159
160
-
-**Dashboard-first approach**: Use Grafana dashboards to visualize trends and patterns
161
-
-**Baseline establishment**: Monitor your specific workload for 2-4 weeks before considering additional alerts
162
-
-**Business SLA alignment**: Only create alerts for metrics that directly impact your business SLA requirements
163
-
-**Manual review**: Regularly review metric trends during business reviews rather than automated alerting
160
+
-**Dashboard-first approach**: Use Grafana dashboards to visualize trends and patterns.
161
+
-**Baseline establishment**: Monitor your specific workload for 2-4 weeks before you consider adding more alerts.
162
+
-**Business SLA alignment**: Only create alerts for metrics that directly impact your business SLA requirements.
163
+
-**Manual review**: Don't use automated alerts to review metric trends. Instead, schedule regular business reviews to check them manually.
0 commit comments