You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/integrate/redis-data-integration/observability.md
+96-28Lines changed: 96 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -46,6 +46,49 @@ These metrics are divided into three groups:
46
46
-**Pipeline state**: metrics about the pipeline mode and connectivity
47
47
-**Data flow counters**: counters for data breakdown per source table
48
48
-**Processing performance**: processing speed of RDI micro batches
49
+
50
+
The following table lists all collector metrics and their descriptions:
51
+
52
+
| Metric | Type | Description | Alerting Recommendations |
53
+
|:--|:--|:--|:--|
54
+
|**Schema History Metrics**||||
55
+
| ChangesApplied | Counter | Total number of schema changes applied during recovery and runtime | Monitor for unexpected spikes (rate > 10/hour) |
56
+
| ChangesRecovered | Counter | Number of changes that were read during the recovery phase | Alert if recovery fails (value stops increasing during recovery) |
57
+
| MilliSecondsSinceLastAppliedChange | Gauge | Number of milliseconds since the last change was applied | Alert if > 300,000ms (5 minutes) during active schema changes |
58
+
| MilliSecondsSinceLastRecoveredChange | Gauge | Number of milliseconds since the last change was recovered from the history store | Alert if > 600,000ms (10 minutes) during recovery |
59
+
| RecoveryStartTime | Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Monitor for prolonged recovery (> 30 minutes) |
60
+
|**Connection and State Metrics**||||
61
+
| Connected | Gauge | Whether the connector is currently connected to the database (1=connected, 0=disconnected) |**Critical Alert**: Alert if value = 0 (disconnected) |
62
+
|**Queue Metrics**||||
63
+
| CurrentQueueSizeInBytes | Gauge | Current size of the connector's internal queue in bytes | Alert if > 80% of MaxQueueSizeInBytes |
64
+
| MaxQueueSizeInBytes | Gauge | Maximum configured size of the connector's internal queue in bytes | Informational - use for capacity planning |
65
+
| QueueRemainingCapacity | Gauge | Remaining capacity of the connector's internal queue |**High Priority**: Alert if < 20% of total capacity |
66
+
| QueueTotalCapacity | Gauge | Total capacity of the connector's internal queue | Informational - use for capacity planning |
67
+
|**Streaming Performance Metrics**||||
68
+
| MilliSecondsBehindSource | Gauge | Number of milliseconds the connector is behind the source database (-1 if not applicable) |**High Priority**: Alert if > 60,000ms (1 minute) behind source |
69
+
| MilliSecondsSinceLastEvent | Gauge | Number of milliseconds since the connector processed the most recent event (-1 if not applicable) |**Critical Alert**: Alert if > 300,000ms (5 minutes) in active systems |
70
+
| NumberOfCommittedTransactions | Counter | Number of committed transactions processed by the connector | Monitor rate - alert if drops to 0 for > 10 minutes in active systems |
71
+
| NumberOfEventsFiltered | Counter | Number of events filtered by include/exclude list rules | Monitor rate for unexpected increases (> 50% of total events) |
72
+
|**Event Counters**||||
73
+
| TotalNumberOfCreateEventsSeen | Counter | Total number of CREATE (INSERT) events seen by the connector | Monitor rate for business logic validation |
74
+
| TotalNumberOfDeleteEventsSeen | Counter | Total number of DELETE events seen by the connector | Monitor rate for business logic validation |
75
+
| TotalNumberOfEventsSeen | Counter | Total number of events seen by the connector |**High Priority**: Alert if rate drops to 0 for > 10 minutes in active systems |
76
+
| TotalNumberOfUpdateEventsSeen | Counter | Total number of UPDATE events seen by the connector | Monitor rate for business logic validation |
77
+
| NumberOfErroneousEvents | Counter | Number of events that caused errors during processing |**Critical Alert**: Alert if > 0 (any errors) |
78
+
|**Snapshot Metrics**||||
79
+
| RemainingTableCount | Gauge | Number of tables remaining to be processed during snapshot | Monitor for stuck snapshots (no change for > 30 minutes) |
80
+
| RowsScanned | Counter | Number of rows scanned per table during snapshot (reported per table) | Monitor rate for progress tracking |
81
+
| SnapshotAborted | Gauge | Whether the snapshot was aborted (1=aborted, 0=not aborted) |**Critical Alert**: Alert if value = 1 (aborted) |
82
+
| SnapshotCompleted | Gauge | Whether the snapshot completed successfully (1=completed, 0=not completed) | Monitor for successful completion |
83
+
| SnapshotDurationInSeconds | Gauge | Total duration of the snapshot process in seconds | Alert if exceeds expected duration (> 4 hours for large datasets) |
84
+
| SnapshotPaused | Gauge | Whether the snapshot is currently paused (1=paused, 0=not paused) | Alert if paused unexpectedly (value = 1) |
85
+
| SnapshotPausedDurationInSeconds | Gauge | Total time the snapshot was paused in seconds | Alert if paused > 1800 seconds (30 minutes) |
86
+
| SnapshotRunning | Gauge | Whether a snapshot is currently running (1=running, 0=not running) | Monitor for unexpected state changes |
87
+
| TotalTableCount | Gauge | Total number of tables included in the snapshot | Informational - use for progress calculation |
88
+
89
+
{{< note >}}
90
+
Many metrics include context labels that specify the phase (`snapshot` or `streaming`), database name, and other contextual information. Metrics with a value of `-1` typically indicate that the measurement is not applicable in the current state.
91
+
{{< /note >}}
49
92
50
93
## Stream processor metrics
51
94
@@ -55,34 +98,59 @@ RDI reports metrics during the two main phases of the ingest pipeline, the *snap
55
98
phase and the *change data capture (CDC)* phase. (See the
|`incoming_records_total`| Counter | Total number of incoming records processed by the system |**High Priority**: Alert if rate drops to 0 for > 10 minutes in active systems |
106
+
|`incoming_records_created`| Gauge | Timestamp when the incoming records counter was created | Informational - no alerting needed |
107
+
|`processed_records_total`| Counter | Total number of records that have been successfully processed | Monitor processing rate - alert if significantly slower than incoming rate |
108
+
|`rejected_records_total`| Counter | Total number of records that were rejected during processing |**Critical Alert**: Alert if > 0 (any rejections indicate data quality issues) |
109
+
|`filtered_records_total`| Counter | Total number of records that were filtered out during processing | Monitor rate - alert if > 50% of incoming records are filtered |
110
+
|`rdi_engine_state`| Gauge | Current state of the RDI engine with labels for `state` (e.g., STARTED, RUNNING) and `sync_mode` (e.g., SNAPSHOT, STREAMING) |**Critical Alert**: Alert if state != "RUNNING" for > 5 minutes |
111
+
|`rdi_version_info`| Gauge | Version information for RDI components with labels for `cli` and `engine` versions | Informational - use for version tracking |
112
+
|`monitor_time_elapsed_total`| Counter | Total time elapsed (in seconds) since monitoring started | Informational - use for uptime tracking |
113
+
|`monitor_time_elapsed_created`| Gauge | Timestamp when the monitor time elapsed counter was created | Informational - no alerting needed |
114
+
|`rdi_incoming_entries`| Gauge | Count of incoming events by `data_source` and `operation` type (pending, inserted, updated, deleted, filtered, rejected) |**High Priority**: Alert if "rejected" > 0 or "pending" accumulates without processing |
115
+
|`rdi_stream_event_latency_ms`| Gauge | Latency in milliseconds of the oldest event in each data stream, labeled by `data_source`|**High Priority**: Alert if > 60,000ms (1 minute) for real-time use cases |
116
+
117
+
{{< note >}}
118
+
**Additional information about stream processor metrics:**
119
+
120
+
- The `rdi_` prefix comes from the Kubernetes namespace where RDI is installed. For VM install it is always this value.
121
+
- Metrics with `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
122
+
- The `rdi_incoming_entries` metric provides detailed breakdown by operation type for each data source.
123
+
- The `rdi_stream_event_latency_ms` metric helps monitor data freshness and processing delays.
124
+
{{< /note >}}
125
+
126
+
## Recommended alerting strategy
127
+
128
+
Based on operational experience, the following metrics require immediate attention:
129
+
130
+
### Critical alerts (immediate response required)
131
+
-**`Connected = 0`**: Database connectivity lost
132
+
-**`NumberOfErroneousEvents > 0`**: Data processing errors occurring
133
+
-**`rejected_records_total > 0`**: Records being rejected (data quality issues)
134
+
-**`SnapshotAborted = 1`**: Snapshot process failed
135
+
-**`rdi_engine_state != "RUNNING"`**: RDI engine not in expected state
136
+
137
+
### High priority alerts (response within 15 minutes)
138
+
-**`MilliSecondsBehindSource > 60000`**: Replication lag exceeding 1 minute
139
+
-**`MilliSecondsSinceLastEvent > 300000`**: No events processed for 5+ minutes
0 commit comments