Skip to content

Commit d6be677

Browse files
DOC-5338: RDI enchance observability page with more metrics information
1 parent 10d4e26 commit d6be677

File tree

1 file changed

+96
-28
lines changed

1 file changed

+96
-28
lines changed

content/integrate/redis-data-integration/observability.md

Lines changed: 96 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,49 @@ These metrics are divided into three groups:
4646
- **Pipeline state**: metrics about the pipeline mode and connectivity
4747
- **Data flow counters**: counters for data breakdown per source table
4848
- **Processing performance**: processing speed of RDI micro batches
49+
50+
The following table lists all collector metrics and their descriptions:
51+
52+
| Metric | Type | Description | Alerting Recommendations |
53+
|:--|:--|:--|:--|
54+
| **Schema History Metrics** | | | |
55+
| ChangesApplied | Counter | Total number of schema changes applied during recovery and runtime | Monitor for unexpected spikes (rate > 10/hour) |
56+
| ChangesRecovered | Counter | Number of changes that were read during the recovery phase | Alert if recovery fails (value stops increasing during recovery) |
57+
| MilliSecondsSinceLastAppliedChange | Gauge | Number of milliseconds since the last change was applied | Alert if > 300,000ms (5 minutes) during active schema changes |
58+
| MilliSecondsSinceLastRecoveredChange | Gauge | Number of milliseconds since the last change was recovered from the history store | Alert if > 600,000ms (10 minutes) during recovery |
59+
| RecoveryStartTime | Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Monitor for prolonged recovery (> 30 minutes) |
60+
| **Connection and State Metrics** | | | |
61+
| Connected | Gauge | Whether the connector is currently connected to the database (1=connected, 0=disconnected) | **Critical Alert**: Alert if value = 0 (disconnected) |
62+
| **Queue Metrics** | | | |
63+
| CurrentQueueSizeInBytes | Gauge | Current size of the connector's internal queue in bytes | Alert if > 80% of MaxQueueSizeInBytes |
64+
| MaxQueueSizeInBytes | Gauge | Maximum configured size of the connector's internal queue in bytes | Informational - use for capacity planning |
65+
| QueueRemainingCapacity | Gauge | Remaining capacity of the connector's internal queue | **High Priority**: Alert if < 20% of total capacity |
66+
| QueueTotalCapacity | Gauge | Total capacity of the connector's internal queue | Informational - use for capacity planning |
67+
| **Streaming Performance Metrics** | | | |
68+
| MilliSecondsBehindSource | Gauge | Number of milliseconds the connector is behind the source database (-1 if not applicable) | **High Priority**: Alert if > 60,000ms (1 minute) behind source |
69+
| MilliSecondsSinceLastEvent | Gauge | Number of milliseconds since the connector processed the most recent event (-1 if not applicable) | **Critical Alert**: Alert if > 300,000ms (5 minutes) in active systems |
70+
| NumberOfCommittedTransactions | Counter | Number of committed transactions processed by the connector | Monitor rate - alert if drops to 0 for > 10 minutes in active systems |
71+
| NumberOfEventsFiltered | Counter | Number of events filtered by include/exclude list rules | Monitor rate for unexpected increases (> 50% of total events) |
72+
| **Event Counters** | | | |
73+
| TotalNumberOfCreateEventsSeen | Counter | Total number of CREATE (INSERT) events seen by the connector | Monitor rate for business logic validation |
74+
| TotalNumberOfDeleteEventsSeen | Counter | Total number of DELETE events seen by the connector | Monitor rate for business logic validation |
75+
| TotalNumberOfEventsSeen | Counter | Total number of events seen by the connector | **High Priority**: Alert if rate drops to 0 for > 10 minutes in active systems |
76+
| TotalNumberOfUpdateEventsSeen | Counter | Total number of UPDATE events seen by the connector | Monitor rate for business logic validation |
77+
| NumberOfErroneousEvents | Counter | Number of events that caused errors during processing | **Critical Alert**: Alert if > 0 (any errors) |
78+
| **Snapshot Metrics** | | | |
79+
| RemainingTableCount | Gauge | Number of tables remaining to be processed during snapshot | Monitor for stuck snapshots (no change for > 30 minutes) |
80+
| RowsScanned | Counter | Number of rows scanned per table during snapshot (reported per table) | Monitor rate for progress tracking |
81+
| SnapshotAborted | Gauge | Whether the snapshot was aborted (1=aborted, 0=not aborted) | **Critical Alert**: Alert if value = 1 (aborted) |
82+
| SnapshotCompleted | Gauge | Whether the snapshot completed successfully (1=completed, 0=not completed) | Monitor for successful completion |
83+
| SnapshotDurationInSeconds | Gauge | Total duration of the snapshot process in seconds | Alert if exceeds expected duration (> 4 hours for large datasets) |
84+
| SnapshotPaused | Gauge | Whether the snapshot is currently paused (1=paused, 0=not paused) | Alert if paused unexpectedly (value = 1) |
85+
| SnapshotPausedDurationInSeconds | Gauge | Total time the snapshot was paused in seconds | Alert if paused > 1800 seconds (30 minutes) |
86+
| SnapshotRunning | Gauge | Whether a snapshot is currently running (1=running, 0=not running) | Monitor for unexpected state changes |
87+
| TotalTableCount | Gauge | Total number of tables included in the snapshot | Informational - use for progress calculation |
88+
89+
{{< note >}}
90+
Many metrics include context labels that specify the phase (`snapshot` or `streaming`), database name, and other contextual information. Metrics with a value of `-1` typically indicate that the measurement is not applicable in the current state.
91+
{{< /note >}}
4992

5093
## Stream processor metrics
5194

@@ -55,34 +98,59 @@ RDI reports metrics during the two main phases of the ingest pipeline, the *snap
5598
phase and the *change data capture (CDC)* phase. (See the
5699
[pipeline lifecycle]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines" >}})
57100
docs for more information). The table below shows the full set of metrics that
58-
RDI reports.
59-
60-
| Metric | Phase |
61-
|:-- |:-- |
62-
| CapturedTables | Both |
63-
| Connected | CDC |
64-
| LastEvent | Both |
65-
| LastTransactionId | CDC |
66-
| MilliSecondsBehindSource | CDC |
67-
| MilliSecondsSinceLastEvent | Both |
68-
| NumberOfCommittedTransactions | CDC |
69-
| NumberOfEventsFiltered | Both |
70-
| QueueRemainingCapacity | Both |
71-
| QueueTotalCapacity | Both |
72-
| RemainingTableCount | Snapshot |
73-
| RowsScanned | Snapshot |
74-
| SnapshotAborted | Snapshot |
75-
| SnapshotCompleted | Snapshot |
76-
| SnapshotDurationInSeconds | Snapshot |
77-
| SnapshotPaused | Snapshot |
78-
| SnapshotPausedDurationInSeconds | Snapshot |
79-
| SnapshotRunning | Snapshot |
80-
| SourceEventPosition | CDC |
81-
| TotalNumberOfCreateEventsSeen | CDC |
82-
| TotalNumberOfDeleteEventsSeen | CDC |
83-
| TotalNumberOfEventsSeen | Both |
84-
| TotalNumberOfUpdateEventsSeen | CDC |
85-
| TotalTableCount | Snapshot |
101+
RDI reports with their descriptions.
102+
103+
| Metric Name | Metric Type | Metric Description | Alerting Recommendations |
104+
|-------------|-------------|--------------------|-----------------------|
105+
| `incoming_records_total` | Counter | Total number of incoming records processed by the system | **High Priority**: Alert if rate drops to 0 for > 10 minutes in active systems |
106+
| `incoming_records_created` | Gauge | Timestamp when the incoming records counter was created | Informational - no alerting needed |
107+
| `processed_records_total` | Counter | Total number of records that have been successfully processed | Monitor processing rate - alert if significantly slower than incoming rate |
108+
| `rejected_records_total` | Counter | Total number of records that were rejected during processing | **Critical Alert**: Alert if > 0 (any rejections indicate data quality issues) |
109+
| `filtered_records_total` | Counter | Total number of records that were filtered out during processing | Monitor rate - alert if > 50% of incoming records are filtered |
110+
| `rdi_engine_state` | Gauge | Current state of the RDI engine with labels for `state` (e.g., STARTED, RUNNING) and `sync_mode` (e.g., SNAPSHOT, STREAMING) | **Critical Alert**: Alert if state != "RUNNING" for > 5 minutes |
111+
| `rdi_version_info` | Gauge | Version information for RDI components with labels for `cli` and `engine` versions | Informational - use for version tracking |
112+
| `monitor_time_elapsed_total` | Counter | Total time elapsed (in seconds) since monitoring started | Informational - use for uptime tracking |
113+
| `monitor_time_elapsed_created` | Gauge | Timestamp when the monitor time elapsed counter was created | Informational - no alerting needed |
114+
| `rdi_incoming_entries` | Gauge | Count of incoming events by `data_source` and `operation` type (pending, inserted, updated, deleted, filtered, rejected) | **High Priority**: Alert if "rejected" > 0 or "pending" accumulates without processing |
115+
| `rdi_stream_event_latency_ms` | Gauge | Latency in milliseconds of the oldest event in each data stream, labeled by `data_source` | **High Priority**: Alert if > 60,000ms (1 minute) for real-time use cases |
116+
117+
{{< note >}}
118+
**Additional information about stream processor metrics:**
119+
120+
- The `rdi_` prefix comes from the Kubernetes namespace where RDI is installed. For VM install it is always this value.
121+
- Metrics with `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
122+
- The `rdi_incoming_entries` metric provides detailed breakdown by operation type for each data source.
123+
- The `rdi_stream_event_latency_ms` metric helps monitor data freshness and processing delays.
124+
{{< /note >}}
125+
126+
## Recommended alerting strategy
127+
128+
Based on operational experience, the following metrics require immediate attention:
129+
130+
### Critical alerts (immediate response required)
131+
- **`Connected = 0`**: Database connectivity lost
132+
- **`NumberOfErroneousEvents > 0`**: Data processing errors occurring
133+
- **`rejected_records_total > 0`**: Records being rejected (data quality issues)
134+
- **`SnapshotAborted = 1`**: Snapshot process failed
135+
- **`rdi_engine_state != "RUNNING"`**: RDI engine not in expected state
136+
137+
### High priority alerts (response within 15 minutes)
138+
- **`MilliSecondsBehindSource > 60000`**: Replication lag exceeding 1 minute
139+
- **`MilliSecondsSinceLastEvent > 300000`**: No events processed for 5+ minutes
140+
- **`QueueRemainingCapacity < 20%`**: Queue capacity critically low
141+
- **`rdi_stream_event_latency_ms > 60000`**: Event processing latency too high
142+
- **`TotalNumberOfEventsSeen` rate = 0**: No events flowing for 10+ minutes
143+
144+
### Medium priority alerts (response within 1 hour)
145+
- **Queue utilization > 80%**: Approaching capacity limits
146+
- **Snapshot duration > expected baseline**: Performance degradation
147+
- **High filtering rate (> 50%)**: Potential configuration issues
148+
149+
### Monitoring best practices
150+
- Set up alerting rules in your monitoring system (Prometheus Alertmanager, Grafana, etc.)
151+
- Use rate() functions for counter metrics to detect changes in processing patterns
152+
- Establish baseline values for your specific workload before setting thresholds
153+
- Consider business hours and maintenance windows when configuring alert schedules
86154

87155
## RDI logs
88156

0 commit comments

Comments
 (0)