Skip to content

Commit e7f9623

Browse files
paulohtb6micheleRP
andauthored
add shadowing metrics (#1496)
Co-authored-by: micheleRP <[email protected]>
1 parent 49b8877 commit e7f9623

File tree

3 files changed

+132
-17
lines changed

3 files changed

+132
-17
lines changed

modules/get-started/pages/release-notes/redpanda.adoc

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ Redpanda v25.3 introduces xref:deploy:redpanda/manual/disaster-recovery/shadowin
2121

2222
The shadow cluster operates in read-only mode while continuously receiving updates from the source cluster. During a disaster, you can failover individual topics or an entire shadow link to make resources fully writable for production traffic. See xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover-runbook.adoc[] for emergency procedures.
2323

24+
Shadowing includes comprehensive metrics for monitoring replication health. See xref:manage:disaster-recovery/shadowing/monitor.adoc[] and xref:reference:public-metrics-reference.adoc#shadow-link-metrics[Shadow Link metrics reference].
25+
2426
== Connected client monitoring
2527

2628
You can view details about Kafka client connections using `rpk` or the Admin API ListKafkaConnections endpoint. This allows you to view detailed information about active client connections on a cluster, and identify and troubleshoot problematic clients. For more information, see the xref:manage:cluster-maintenance/manage-throughput.adoc#view-connected-client-details[connected client details] example in the Manage Throughput guide.
@@ -51,6 +53,20 @@ You can now generate a security report for your Redpanda cluster using the link:
5153

5254
Redpanda v25.3 implements topic identifiers using 16 byte UUIDs as proposed in https://cwiki.apache.org/confluence/display/KAFKA/KIP-516%3A+Topic+Identifiers[KIP-516^].
5355

56+
== Shadowing metrics
57+
58+
Redpanda v25.3 introduces comprehensive xref:reference:public-metrics-reference.adoc#shadow-link-metrics[Shadowing metrics] for monitoring disaster recovery replication:
59+
60+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_client_errors[`redpanda_shadow_link_client_errors`] - Track Kafka client errors during shadow link operations
61+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_lag[`redpanda_shadow_link_shadow_lag`] - Monitor replication lag between source and shadow partitions
62+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_topic_state[`redpanda_shadow_link_shadow_topic_state`] - Track shadow topic state distribution across links
63+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_fetched[`redpanda_shadow_link_total_bytes_fetched`] - Monitor data transfer volume from source cluster
64+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_written[`redpanda_shadow_link_total_bytes_written`] - Track data written to shadow cluster
65+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_fetched[`redpanda_shadow_link_total_records_fetched`] - Monitor total records fetched from source cluster
66+
* xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_written[`redpanda_shadow_link_total_records_written`] - Track total messages written to shadow cluster
67+
68+
For monitoring guidance and alert recommendations, see xref:manage:disaster-recovery/shadowing/monitor.adoc[].
69+
5470
== New commands
5571

5672
Redpanda v25.3 introduces the following xref:reference:rpk/rpk-shadow/rpk-shadow.adoc[`rpk shadow`] commands for managing Redpanda shadow links:

modules/manage/pages/disaster-recovery/shadowing/monitor.adoc

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -56,36 +56,36 @@ Shadowing provides comprehensive metrics to track replication performance and he
5656
|===
5757
|Metric |Type |Description
5858

59-
|`redpanda_shadow_link_shadow_lag`
59+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_client_errors[`redpanda_shadow_link_client_errors`]
60+
|Counter
61+
|Total number of errors encountered by the Kafka client during shadow link operations. Monitor by `shadow_link_name` to identify connection issues, authentication failures, or other client-side problems.
62+
63+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_lag[`redpanda_shadow_link_shadow_lag`]
6064
|Gauge
6165
|The lag of the shadow partition against the source partition, calculated as source partition LSO (Last Stable Offset) minus shadow partition HWM (High Watermark). Monitor by `shadow_link_name`, `topic`, and `partition` to understand replication lag for each partition.
6266

63-
|`redpanda_shadow_link_total_bytes_fetched`
64-
|Count
67+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_fetched[`redpanda_shadow_link_total_bytes_fetched`]
68+
|Counter
6569
|The total number of bytes fetched by a sharded replicator (bytes received by the client). Labeled by `shadow_link_name` and `shard` to track data transfer volume from the source cluster.
6670

67-
|`redpanda_shadow_link_total_bytes_written`
68-
|Count
71+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_bytes_written[`redpanda_shadow_link_total_bytes_written`]
72+
|Counter
6973
|The total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor data written to the shadow cluster.
7074

71-
|`redpanda_shadow_link_client_errors`
72-
|Count
73-
|The number of errors seen by the client. Track by `shadow_link_name` and `shard` to identify connection or protocol issues between clusters.
74-
75-
|`redpanda_shadow_link_shadow_topic_state`
75+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_topic_state[`redpanda_shadow_link_shadow_topic_state`]
7676
|Gauge
7777
|Number of shadow topics in the respective states. Labeled by `shadow_link_name` and `state` to monitor topic state distribution across your shadow links.
7878

79-
|`redpanda_shadow_link_total_records_fetched`
80-
|Count
79+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_fetched[`redpanda_shadow_link_total_records_fetched`]
80+
|Counter
8181
|The total number of records fetched by the sharded replicator (records received by the client). Monitor by `shadow_link_name` and `shard` to track message throughput from the source.
8282

83-
|`redpanda_shadow_link_total_records_written`
84-
|Count
83+
|xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_total_records_written[`redpanda_shadow_link_total_records_written`]
84+
|Counter
8585
|The total number of records written by a sharded replicator (records written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor message throughput to the shadow cluster.
8686
|===
8787

88-
See also: xref:reference:public-metrics-reference.adoc[]
88+
For detailed descriptions of each metric, including usage examples and label definitions, see xref:reference:public-metrics-reference.adoc#shadow-link-metrics[Shadow Link metrics reference].
8989

9090
== Monitoring best practices
9191

@@ -106,8 +106,7 @@ rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
106106

107107
Configure monitoring alerts for the following conditions, which indicate problems with Shadowing:
108108

109-
* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
110-
* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
109+
* **High replication lag**: When xref:reference:public-metrics-reference.adoc#redpanda_shadow_link_shadow_lag[`redpanda_shadow_link_shadow_lag`] exceeds your recovery point objective (RPO) requirements
111110
* **Topic state changes**: When topics move to `FAULTED` state
112111
* **Task failures**: When replication tasks enter `FAULTED` or `NOT_RUNNING` states
113112
* **Throughput drops**: When bytes/records fetched drops significantly

modules/reference/pages/public-metrics-reference.adoc

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3418,6 +3418,106 @@ ifdef::env-cloud[]
34183418
*Available in Serverless*: No
34193419
endif::[]
34203420

3421+
---
3422+
3423+
== Shadow link metrics
3424+
3425+
[[redpanda_shadow_link_shadow_lag]]
3426+
=== redpanda_shadow_link_shadow_lag
3427+
3428+
The lag of the shadow partition against the source partition, calculated as source partition last stable offset (LSO) minus shadow partition high watermark (HWM). Monitor this metric to understand replication lag for each partition and ensure your recovery point objective (RPO) requirements are being met.
3429+
3430+
*Type*: gauge
3431+
3432+
*Labels*:
3433+
3434+
- `shadow_link_name` - Name of the shadow link
3435+
- `topic` - Topic name
3436+
- `partition` - Partition identifier
3437+
3438+
---
3439+
3440+
[[redpanda_shadow_link_shadow_topic_state]]
3441+
=== redpanda_shadow_link_shadow_topic_state
3442+
3443+
Number of shadow topics in the respective states. Monitor this metric to track the health and status distribution of shadow topics across your shadow links.
3444+
3445+
*Type*: gauge
3446+
3447+
*Labels*:
3448+
3449+
- `shadow_link_name` - Name of the shadow link
3450+
- `state` - Topic state (active, failed, paused, failing_over, failed_over, promoting, promoted)
3451+
3452+
---
3453+
3454+
[[redpanda_shadow_link_client_errors]]
3455+
=== redpanda_shadow_link_client_errors
3456+
3457+
Total number of errors encountered by the Kafka client during shadow link operations. Monitor this metric to identify connection issues, authentication failures, or other client-side problems affecting shadow link replication.
3458+
3459+
*Type*: counter
3460+
3461+
*Labels*:
3462+
3463+
- `shadow_link_name` - Name of the shadow link
3464+
3465+
---
3466+
3467+
[[redpanda_shadow_link_total_bytes_fetched]]
3468+
=== redpanda_shadow_link_total_bytes_fetched
3469+
3470+
Total number of bytes fetched by a sharded replicator (bytes received by the client). Use this metric to track data transfer volume from the source cluster.
3471+
3472+
*Type*: counter
3473+
3474+
*Labels*:
3475+
3476+
- `shadow_link_name` - Name of the shadow link
3477+
- `shard` - Shard identifier
3478+
3479+
---
3480+
3481+
[[redpanda_shadow_link_total_bytes_written]]
3482+
=== redpanda_shadow_link_total_bytes_written
3483+
3484+
Total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Use this metric to monitor data written to the shadow cluster.
3485+
3486+
*Type*: counter
3487+
3488+
*Labels*:
3489+
3490+
- `shadow_link_name` - Name of the shadow link
3491+
- `shard` - Shard identifier
3492+
3493+
---
3494+
3495+
[[redpanda_shadow_link_total_records_fetched]]
3496+
=== redpanda_shadow_link_total_records_fetched
3497+
3498+
Total number of records fetched by the sharded replicator (records received by the client). Monitor this metric to track message throughput from the source cluster.
3499+
3500+
*Type*: counter
3501+
3502+
*Labels*:
3503+
3504+
- `shadow_link_name` - Name of the shadow link
3505+
- `shard` - Shard identifier
3506+
3507+
---
3508+
3509+
[[redpanda_shadow_link_total_records_written]]
3510+
=== redpanda_shadow_link_total_records_written
3511+
3512+
Total number of records written by a sharded replicator (records written to the write_at_offset_stm). Use this metric to monitor message throughput to the shadow cluster.
3513+
3514+
*Type*: counter
3515+
3516+
*Labels*:
3517+
3518+
- `shadow_link_name` - Name of the shadow link
3519+
- `shard` - Shard identifier
3520+
34213521
== Related topics
34223522

34233523
* xref:manage:monitoring.adoc[Learn how to monitor Redpanda]

0 commit comments

Comments
 (0)