DOC-5338: RDI enchance observability page with more metrics information

ZdravkoDonev-redis · ZdravkoDonev-redis · commit d6be6771c8ef · 2025-06-13T15:10:32.000+03:00
diff --git a/content/integrate/redis-data-integration/observability.md b/content/integrate/redis-data-integration/observability.md
@@ -46,6 +46,49 @@ These metrics are divided into three groups:
 - **Pipeline state**: metrics about the pipeline mode and connectivity
 - **Data flow counters**: counters for data breakdown per source table
 - **Processing performance**: processing speed of RDI micro batches
+
+The following table lists all collector metrics and their descriptions:
+
+| Metric | Type | Description | Alerting Recommendations |
+|:--|:--|:--|:--|
+| **Schema History Metrics** | | | |
+| ChangesApplied | Counter | Total number of schema changes applied during recovery and runtime | Monitor for unexpected spikes (rate > 10/hour) |
+| ChangesRecovered | Counter | Number of changes that were read during the recovery phase | Alert if recovery fails (value stops increasing during recovery) |
+| MilliSecondsSinceLastAppliedChange | Gauge | Number of milliseconds since the last change was applied | Alert if > 300,000ms (5 minutes) during active schema changes |
+| MilliSecondsSinceLastRecoveredChange | Gauge | Number of milliseconds since the last change was recovered from the history store | Alert if > 600,000ms (10 minutes) during recovery |
+| RecoveryStartTime | Gauge | Time in epoch milliseconds when recovery started (-1 if not applicable) | Monitor for prolonged recovery (> 30 minutes) |
+| **Connection and State Metrics** | | | |
+| Connected | Gauge | Whether the connector is currently connected to the database (1=connected, 0=disconnected) | **Critical Alert**: Alert if value = 0 (disconnected) |
+| **Queue Metrics** | | | |
+| CurrentQueueSizeInBytes | Gauge | Current size of the connector's internal queue in bytes | Alert if > 80% of MaxQueueSizeInBytes |
+| MaxQueueSizeInBytes | Gauge | Maximum configured size of the connector's internal queue in bytes | Informational - use for capacity planning |
+| QueueRemainingCapacity | Gauge | Remaining capacity of the connector's internal queue | **High Priority**: Alert if < 20% of total capacity |
+| QueueTotalCapacity | Gauge | Total capacity of the connector's internal queue | Informational - use for capacity planning |
+| **Streaming Performance Metrics** | | | |
+| MilliSecondsBehindSource | Gauge | Number of milliseconds the connector is behind the source database (-1 if not applicable) | **High Priority**: Alert if > 60,000ms (1 minute) behind source |
+| MilliSecondsSinceLastEvent | Gauge | Number of milliseconds since the connector processed the most recent event (-1 if not applicable) | **Critical Alert**: Alert if > 300,000ms (5 minutes) in active systems |
+| NumberOfCommittedTransactions | Counter | Number of committed transactions processed by the connector | Monitor rate - alert if drops to 0 for > 10 minutes in active systems |
+| NumberOfEventsFiltered | Counter | Number of events filtered by include/exclude list rules | Monitor rate for unexpected increases (> 50% of total events) |
+| **Event Counters** | | | |
+| TotalNumberOfCreateEventsSeen | Counter | Total number of CREATE (INSERT) events seen by the connector | Monitor rate for business logic validation |
+| TotalNumberOfDeleteEventsSeen | Counter | Total number of DELETE events seen by the connector | Monitor rate for business logic validation |
+| TotalNumberOfEventsSeen | Counter | Total number of events seen by the connector | **High Priority**: Alert if rate drops to 0 for > 10 minutes in active systems |
+| TotalNumberOfUpdateEventsSeen | Counter | Total number of UPDATE events seen by the connector | Monitor rate for business logic validation |
+| NumberOfErroneousEvents | Counter | Number of events that caused errors during processing | **Critical Alert**: Alert if > 0 (any errors) |
+| **Snapshot Metrics** | | | |
+| RemainingTableCount | Gauge | Number of tables remaining to be processed during snapshot | Monitor for stuck snapshots (no change for > 30 minutes) |
+| RowsScanned | Counter | Number of rows scanned per table during snapshot (reported per table) | Monitor rate for progress tracking |
+| SnapshotAborted | Gauge | Whether the snapshot was aborted (1=aborted, 0=not aborted) | **Critical Alert**: Alert if value = 1 (aborted) |
+| SnapshotCompleted | Gauge | Whether the snapshot completed successfully (1=completed, 0=not completed) | Monitor for successful completion |
+| SnapshotDurationInSeconds | Gauge | Total duration of the snapshot process in seconds | Alert if exceeds expected duration (> 4 hours for large datasets) |
+| SnapshotPaused | Gauge | Whether the snapshot is currently paused (1=paused, 0=not paused) | Alert if paused unexpectedly (value = 1) |
+| SnapshotPausedDurationInSeconds | Gauge | Total time the snapshot was paused in seconds | Alert if paused > 1800 seconds (30 minutes) |
+| SnapshotRunning | Gauge | Whether a snapshot is currently running (1=running, 0=not running) | Monitor for unexpected state changes |
+| TotalTableCount | Gauge | Total number of tables included in the snapshot | Informational - use for progress calculation |
+
+{{< note >}}
+Many metrics include context labels that specify the phase (`snapshot` or `streaming`), database name, and other contextual information. Metrics with a value of `-1` typically indicate that the measurement is not applicable in the current state.
+{{< /note >}}
   
 ## Stream processor metrics
 
@@ -55,34 +98,59 @@ RDI reports metrics during the two main phases of the ingest pipeline, the *snap
 phase and the *change data capture (CDC)* phase. (See the
 [pipeline lifecycle]({{< relref "/integrate/redis-data-integration/data-pipelines/data-pipelines" >}})
 docs for more information). The table below shows the full set of metrics that
-RDI reports. 
-
-| Metric | Phase |
-|:-- |:-- |
-| CapturedTables | Both |
-| Connected | CDC |
-| LastEvent | Both |
-| LastTransactionId | CDC |
-| MilliSecondsBehindSource | CDC |
-| MilliSecondsSinceLastEvent | Both |
-| NumberOfCommittedTransactions | CDC |
-| NumberOfEventsFiltered | Both |
-| QueueRemainingCapacity | Both |
-| QueueTotalCapacity | Both |
-| RemainingTableCount | Snapshot |
-| RowsScanned | Snapshot |
-| SnapshotAborted | Snapshot |
-| SnapshotCompleted | Snapshot |
-| SnapshotDurationInSeconds | Snapshot |
-| SnapshotPaused | Snapshot |
-| SnapshotPausedDurationInSeconds | Snapshot |
-| SnapshotRunning | Snapshot |
-| SourceEventPosition | CDC |
-| TotalNumberOfCreateEventsSeen | CDC |
-| TotalNumberOfDeleteEventsSeen | CDC |
-| TotalNumberOfEventsSeen | Both |
-| TotalNumberOfUpdateEventsSeen | CDC |
-| TotalTableCount | Snapshot |
+RDI reports with their descriptions. 
+
+| Metric Name | Metric Type | Metric Description | Alerting Recommendations |
+|-------------|-------------|--------------------|-----------------------|
+| `incoming_records_total` | Counter | Total number of incoming records processed by the system | **High Priority**: Alert if rate drops to 0 for > 10 minutes in active systems |
+| `incoming_records_created` | Gauge | Timestamp when the incoming records counter was created | Informational - no alerting needed |
+| `processed_records_total` | Counter | Total number of records that have been successfully processed | Monitor processing rate - alert if significantly slower than incoming rate |
+| `rejected_records_total` | Counter | Total number of records that were rejected during processing | **Critical Alert**: Alert if > 0 (any rejections indicate data quality issues) |
+| `filtered_records_total` | Counter | Total number of records that were filtered out during processing | Monitor rate - alert if > 50% of incoming records are filtered |
+| `rdi_engine_state` | Gauge | Current state of the RDI engine with labels for `state` (e.g., STARTED, RUNNING) and `sync_mode` (e.g., SNAPSHOT, STREAMING) | **Critical Alert**: Alert if state != "RUNNING" for > 5 minutes |
+| `rdi_version_info` | Gauge | Version information for RDI components with labels for `cli` and `engine` versions | Informational - use for version tracking |
+| `monitor_time_elapsed_total` | Counter | Total time elapsed (in seconds) since monitoring started | Informational - use for uptime tracking |
+| `monitor_time_elapsed_created` | Gauge | Timestamp when the monitor time elapsed counter was created | Informational - no alerting needed |
+| `rdi_incoming_entries` | Gauge | Count of incoming events by `data_source` and `operation` type (pending, inserted, updated, deleted, filtered, rejected) | **High Priority**: Alert if "rejected" > 0 or "pending" accumulates without processing |
+| `rdi_stream_event_latency_ms` | Gauge | Latency in milliseconds of the oldest event in each data stream, labeled by `data_source` | **High Priority**: Alert if > 60,000ms (1 minute) for real-time use cases |
+
+{{< note >}}
+**Additional information about stream processor metrics:**
+
+- The `rdi_` prefix comes from the Kubernetes namespace where RDI is installed. For VM install it is always this value.
+- Metrics with `_created` suffix are automatically generated by Prometheus for counters and gauges to track when they were first created.
+- The `rdi_incoming_entries` metric provides detailed breakdown by operation type for each data source.
+- The `rdi_stream_event_latency_ms` metric helps monitor data freshness and processing delays.
+{{< /note >}}
+
+## Recommended alerting strategy
+
+Based on operational experience, the following metrics require immediate attention:
+
+### Critical alerts (immediate response required)
+- **`Connected = 0`**: Database connectivity lost
+- **`NumberOfErroneousEvents > 0`**: Data processing errors occurring
+- **`rejected_records_total > 0`**: Records being rejected (data quality issues)
+- **`SnapshotAborted = 1`**: Snapshot process failed
+- **`rdi_engine_state != "RUNNING"`**: RDI engine not in expected state
+
+### High priority alerts (response within 15 minutes)
+- **`MilliSecondsBehindSource > 60000`**: Replication lag exceeding 1 minute
+- **`MilliSecondsSinceLastEvent > 300000`**: No events processed for 5+ minutes
+- **`QueueRemainingCapacity < 20%`**: Queue capacity critically low
+- **`rdi_stream_event_latency_ms > 60000`**: Event processing latency too high
+- **`TotalNumberOfEventsSeen` rate = 0**: No events flowing for 10+ minutes
+
+### Medium priority alerts (response within 1 hour)
+- **Queue utilization > 80%**: Approaching capacity limits
+- **Snapshot duration > expected baseline**: Performance degradation
+- **High filtering rate (> 50%)**: Potential configuration issues
+
+### Monitoring best practices
+- Set up alerting rules in your monitoring system (Prometheus Alertmanager, Grafana, etc.)
+- Use rate() functions for counter metrics to detect changes in processing patterns
+- Establish baseline values for your specific workload before setting thresholds
+- Consider business hours and maintenance windows when configuring alert schedules
 
 ## RDI logs