add shadowing metrics (#1496)

paulohtb6 · micheleRP · web-flow · commit e7f96238d9b9 · 2025-12-12T11:03:02.000-03:00
Co-authored-by: micheleRP &lt;michele@redpanda.com&gt;
diff --git a/modules/get-started/pages/release-notes/redpanda.adoc b/modules/get-started/pages/release-notes/redpanda.adoc
@@ -21,6 +21,8 @@ Redpanda v25.3 introduces xref:deploy:redpanda/manual/disaster-recovery/shadowin
 
 The shadow cluster operates in read-only mode while continuously receiving updates from the source cluster. During a disaster, you can failover individual topics or an entire shadow link to make resources fully writable for production traffic. See xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover-runbook.adoc[] for emergency procedures.
 
+Shadowing includes comprehensive metrics for monitoring replication health. See xref:manage:disaster-recovery/shadowing/monitor.adoc[] and xref:reference:public-metrics-reference.adoc#shadow-link-metrics[Shadow Link metrics reference].
+
 == Connected client monitoring
 
 You can view details about Kafka client connections using `rpk` or the Admin API ListKafkaConnections endpoint. This allows you to view detailed information about active client connections on a cluster, and identify and troubleshoot problematic clients. For more information, see the xref:manage:cluster-maintenance/manage-throughput.adoc#view-connected-client-details[connected client details] example in the Manage Throughput guide.
@@ -51,6 +53,20 @@ You can now generate a security report for your Redpanda cluster using the link:
 
 Redpanda v25.3 implements topic identifiers using 16 byte UUIDs as proposed in https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers[KIP-516^].
 
+== Shadowing metrics
+
+Redpanda v25.3 introduces comprehensive xref:reference:public-metrics-reference.adoc#shadow-link-metrics[Shadowing metrics] for monitoring disaster recovery replication:
+
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_client_errors[`redpanda_shadow_link_client_errors`] - Track Kafka client errors during shadow link operations
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_lag[`redpanda_shadow_link_shadow_lag`] - Monitor replication lag between source and shadow partitions
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_topic_state[`redpanda_shadow_link_shadow_topic_state`] - Track shadow topic state distribution across links
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_fetched[`redpanda_shadow_link_total_bytes_fetched`] - Monitor data transfer volume from source cluster
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_written[`redpanda_shadow_link_total_bytes_written`] - Track data written to shadow cluster
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_fetched[`redpanda_shadow_link_total_records_fetched`] - Monitor total records fetched from source cluster  
+* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_written[`redpanda_shadow_link_total_records_written`] - Track total messages written to shadow cluster
+
+For monitoring guidance and alert recommendations, see xref:manage:disaster-recovery/shadowing/monitor.adoc[].
+
 == New commands
 
 Redpanda v25.3 introduces the following xref:reference:rpk/rpk-shadow/rpk-shadow.adoc[`rpk shadow`] commands for managing Redpanda shadow links:
diff --git a/modules/manage/pages/disaster-recovery/shadowing/monitor.adoc b/modules/manage/pages/disaster-recovery/shadowing/monitor.adoc
@@ -56,36 +56,36 @@ Shadowing provides comprehensive metrics to track replication performance and he
 |===
 |Metric |Type |Description
 
-|`redpanda_shadow_link_shadow_lag`
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_client_errors[`redpanda_shadow_link_client_errors`]
+|Counter
+|Total number of errors encountered by the Kafka client during shadow link operations. Monitor by `shadow_link_name` to identify connection issues, authentication failures, or other client-side problems.
+
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_lag[`redpanda_shadow_link_shadow_lag`]
 |Gauge
 |The lag of the shadow partition against the source partition, calculated as source partition LSO (Last Stable Offset) minus shadow partition HWM (High Watermark). Monitor by `shadow_link_name`, `topic`, and `partition` to understand replication lag for each partition.
 
-|`redpanda_shadow_link_total_bytes_fetched`
-|Count
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_fetched[`redpanda_shadow_link_total_bytes_fetched`]
+|Counter
 |The total number of bytes fetched by a sharded replicator (bytes received by the client). Labeled by `shadow_link_name` and `shard` to track data transfer volume from the source cluster.
 
-|`redpanda_shadow_link_total_bytes_written`
-|Count
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_written[`redpanda_shadow_link_total_bytes_written`]
+|Counter
 |The total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor data written to the shadow cluster.
 
-|`redpanda_shadow_link_client_errors`
-|Count
-|The number of errors seen by the client. Track by `shadow_link_name` and `shard` to identify connection or protocol issues between clusters.
-
-|`redpanda_shadow_link_shadow_topic_state`
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_topic_state[`redpanda_shadow_link_shadow_topic_state`]
 |Gauge
 |Number of shadow topics in the respective states. Labeled by `shadow_link_name` and `state` to monitor topic state distribution across your shadow links.
 
-|`redpanda_shadow_link_total_records_fetched`
-|Count
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_fetched[`redpanda_shadow_link_total_records_fetched`]
+|Counter
 |The total number of records fetched by the sharded replicator (records received by the client). Monitor by `shadow_link_name` and `shard` to track message throughput from the source.
 
-|`redpanda_shadow_link_total_records_written`
-|Count
+|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_written[`redpanda_shadow_link_total_records_written`]
+|Counter
 |The total number of records written by a sharded replicator (records written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor message throughput to the shadow cluster.
 |===
 
-See also: xref:reference:public-metrics-reference.adoc[]
+For detailed descriptions of each metric, including usage examples and label definitions, see xref:reference:public-metrics-reference.adoc#shadow-link-metrics[Shadow Link metrics reference].
 
 == Monitoring best practices
 
@@ -106,8 +106,7 @@ rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
 
 Configure monitoring alerts for the following conditions, which indicate problems with Shadowing:
 
-* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
-* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
+* **High replication lag**: When xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_lag[`redpanda_shadow_link_shadow_lag`] exceeds your recovery point objective (RPO) requirements
 * **Topic state changes**: When topics move to `FAULTED` state
 * **Task failures**: When replication tasks enter `FAULTED` or `NOT_RUNNING` states
 * **Throughput drops**: When bytes/records fetched drops significantly
diff --git a/modules/reference/pages/public-metrics-reference.adoc b/modules/reference/pages/public-metrics-reference.adoc
@@ -3418,6 +3418,106 @@ ifdef::env-cloud[]
 *Available in Serverless*: No
 endif::[]
 
+---
+
+== Shadow link metrics
+
+[[redpanda_shadow_link_shadow_lag]]
+=== redpanda_shadow_link_shadow_lag
+
+The lag of the shadow partition against the source partition, calculated as source partition last stable offset (LSO) minus shadow partition high watermark (HWM). Monitor this metric to understand replication lag for each partition and ensure your recovery point objective (RPO) requirements are being met.
+
+*Type*: gauge
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+- `topic` - Topic name
+- `partition` - Partition identifier
+
+---
+
+[[redpanda_shadow_link_shadow_topic_state]]
+=== redpanda_shadow_link_shadow_topic_state
+
+Number of shadow topics in the respective states. Monitor this metric to track the health and status distribution of shadow topics across your shadow links.
+
+*Type*: gauge
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+- `state` - Topic state (active, failed, paused, failing_over, failed_over, promoting, promoted)
+
+---
+
+[[redpanda_shadow_link_client_errors]]
+=== redpanda_shadow_link_client_errors
+
+Total number of errors encountered by the Kafka client during shadow link operations. Monitor this metric to identify connection issues, authentication failures, or other client-side problems affecting shadow link replication.
+
+*Type*: counter
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+
+---
+
+[[redpanda_shadow_link_total_bytes_fetched]]
+=== redpanda_shadow_link_total_bytes_fetched
+
+Total number of bytes fetched by a sharded replicator (bytes received by the client). Use this metric to track data transfer volume from the source cluster.
+
+*Type*: counter
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+- `shard` - Shard identifier
+
+---
+
+[[redpanda_shadow_link_total_bytes_written]]
+=== redpanda_shadow_link_total_bytes_written
+
+Total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Use this metric to monitor data written to the shadow cluster.
+
+*Type*: counter
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+- `shard` - Shard identifier
+
+---
+
+[[redpanda_shadow_link_total_records_fetched]]
+=== redpanda_shadow_link_total_records_fetched
+
+Total number of records fetched by the sharded replicator (records received by the client). Monitor this metric to track message throughput from the source cluster.
+
+*Type*: counter
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+- `shard` - Shard identifier
+
+---
+
+[[redpanda_shadow_link_total_records_written]]
+=== redpanda_shadow_link_total_records_written
+
+Total number of records written by a sharded replicator (records written to the write_at_offset_stm). Use this metric to monitor message throughput to the shadow cluster.
+
+*Type*: counter
+
+*Labels*:
+
+- `shadow_link_name` - Name of the shadow link
+- `shard` - Shard identifier
+
 == Related topics
 
 * xref:manage:monitoring.adoc[Learn how to monitor Redpanda]