Skip to content

Commit 6f1a495

Browse files
authored
Add Shadowing docs for Kubernetes deployments (#1514)
1 parent e256220 commit 6f1a495

28 files changed

+2556
-371
lines changed

.github/workflows/update-property-docs.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ jobs:
6464
run: |
6565
set -euo pipefail
6666
TAG="${{ steps.tag.outputs.tag }}"
67-
CURRENT=$(grep 'latest-redpanda-tag:' antora.yml | awk '{print $2}' | tr -d '"')
67+
CURRENT=$(grep 'latest-redpanda-tag:' antora.yml | awk '{print $2}' | tr -d "\"'")
6868
6969
echo "📄 Current latest-redpanda-tag in antora.yml: $CURRENT"
7070
echo "🔖 Incoming tag: $TAG"

modules/ROOT/nav.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,10 @@
136136
**** xref:manage:kubernetes/security/k-audit-logging.adoc[Audit Logging]
137137
*** xref:manage:kubernetes/k-rack-awareness.adoc[Rack Awareness]
138138
*** xref:manage:kubernetes/k-remote-read-replicas.adoc[Remote Read Replicas]
139+
*** xref:manage:kubernetes/shadowing/index.adoc[Shadowing]
140+
**** xref:manage:kubernetes/shadowing/k-shadow-linking.adoc[Configure Shadowing]
141+
**** xref:manage:kubernetes/monitoring/k-monitor-shadowing.adoc[Monitor]
142+
**** xref:manage:kubernetes/shadowing/k-failover-runbook.adoc[Failover Runbook]
139143
*** xref:manage:kubernetes/k-manage-resources.adoc[Manage Pod Resources]
140144
*** xref:manage:kubernetes/k-scale-redpanda.adoc[Scale]
141145
*** xref:manage:kubernetes/k-nodewatcher.adoc[]

modules/get-started/pages/release-notes/helm-charts.adoc

Lines changed: 2 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -12,57 +12,6 @@ See also:
1212
* xref:upgrade:k-compatibility.adoc[]
1313
* xref:upgrade:k-rolling-upgrade.adoc[]
1414
15-
== Redpanda chart v25.2.1
15+
== Redpanda chart v25.3.x
1616

17-
link:https://github.com/redpanda-data/redpanda-operator/blob/release/v25.2.x/charts/redpanda/CHANGELOG.md[Changelog^].
18-
19-
=== New chart-wide podTemplate field
20-
21-
You can now use the chart-wide `podTemplate` field to control Pod attributes across all components. This field has lower precedence than `statefulset.podTemplate` and `post_install_job.podTemplate` but will be merged with them.
22-
23-
Additionally, `podTemplate` fields now support template expressions within string fields, allowing you to use Helm templating for dynamic values:
24-
25-
[,yaml]
26-
----
27-
podTemplate:
28-
annotations:
29-
"release-name": '{{ .Release.Name }}'
30-
----
31-
32-
This compensates for functionality lost with the removal of fields like `extraVolumes`, while being more maintainable and less error prone.
33-
34-
=== Improved config-watcher sidecar
35-
36-
The config-watcher sidecar is now a dedicated Go binary that handles user management and simplifies cluster health checks. Health checks no longer fail when the sole issue is that other nodes in the cluster are unavailable.
37-
38-
=== rpk debug bundle now works by default
39-
40-
The chart now creates `Roles` and `RoleBindings` that satisfy the requirements for running `rpk debug bundle --namespace` from any Redpanda Pod. These permissions may be disabled by setting `rbac.rpkDebugBundle=false`.
41-
42-
The Redpanda container now always has a Kubernetes ServiceAccount token mounted to ensure `rpk debug bundle` can be executed successfully.
43-
44-
=== ServiceAccount creation now enabled by default
45-
46-
The `serviceAccount.create` field now defaults to `true`. Previously, the chart used the `default` ServiceAccount and extended it with all bindings.
47-
48-
=== Stricter schema validation
49-
50-
Any unexpected values now result in a validation error. Previously, unexpected values would have been silently ignored.
51-
52-
Ensure your Helm values only include valid fields before upgrading.
53-
54-
=== Redpanda Console v3.1.0
55-
56-
The Console dependency has been updated to v3.1.0. The Console integration (`console.enabled=true`) now uses the chart-managed bootstrap user rather than the first user from `auth.sasl.users`.
57-
58-
=== Deprecated Helm values
59-
60-
The following Helm values are deprecated and will be removed in a future release:
61-
62-
* `statefulset.sidecars.controllers.image`: Use `statefulset.sidecars.image` instead
63-
* `statefulset.sideCars.controllers.createRBAC`: Use `rbac.enabled` or per-controller settings instead
64-
* `statefulset.sideCars.controllers.run`: Use individual controller enabled fields instead
65-
66-
=== Removed Helm values
67-
68-
Several fields have been removed in favor of using `podTemplate`. Before upgrading, review your configurations and migrate removed fields to their `podTemplate` equivalents. For the complete list of removed fields and their replacements, see the link:https://github.com/redpanda-data/redpanda-operator/blob/release/v25.2.x/charts/redpanda/CHANGELOG.md[changelog^].
17+
link:https://github.com/redpanda-data/redpanda-operator/blob/release/v25.3.x/charts/redpanda/CHANGELOG.md[Changelog^].

modules/get-started/pages/release-notes/operator.adoc

Lines changed: 8 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -10,56 +10,16 @@ See also:
1010
* xref:upgrade:k-rolling-upgrade.adoc[]
1111
1212
13-
== Redpanda Operator v25.2.x
13+
== Redpanda Operator v25.3.x
1414

15-
link:https://github.com/redpanda-data/redpanda-operator/blob/release/v25.2.x/operator/CHANGELOG.md[Changelog^]
15+
link:https://github.com/redpanda-data/redpanda-operator/blob/release/v25.3.x/operator/CHANGELOG.md[Changelog^]
1616

17-
=== Cluster scope by default
17+
=== ShadowLink resource for disaster recovery
1818

19-
Starting in v25.2, the Redpanda Operator defaults to cluster scope instead of namespace scope. This change provides several benefits:
19+
Redpanda Operator v25.3.x introduces the ShadowLink custom resource for managing shadow links in Kubernetes. The ShadowLink resource allows you to declaratively configure and manage disaster recovery replication between Redpanda clusters.
2020

21-
* **Simplified management**: A single operator instance can manage multiple Redpanda clusters across different namespaces.
22-
* **Reduced resource overhead**: No need to deploy separate operator instances for each namespace.
23-
* **Centralized upgrades**: Upgrade the operator once to benefit all managed Redpanda clusters.
24-
* **Cross-namespace management**: Deploy the operator in a dedicated namespace (such as `redpanda-system`) while managing clusters in application namespaces.
25-
* **Simplified RBAC for debug bundles**: The Redpanda Operator now provides all required permissions for `rpk` debug bundle collection by default. The `rbac.createRPKBundleCRs` flag is no longer needed.
21+
* **Declarative configuration**: Define shadow links as Kubernetes resources with full lifecycle management.
22+
* **Status monitoring**: View shadow link health and replication status directly from Kubernetes.
23+
* **Integrated failover**: Delete the ShadowLink resource to fail over all topics.
2624

27-
==== Migration considerations
28-
29-
If you're upgrading from a previous version that used namespace-scoped operators:
30-
31-
* **No manual steps required**: The Redpanda Operator automatically reconciles existing Redpanda clusters across namespaces.
32-
* **New deployments default to cluster scope**: Regardless of which namespace you deploy the Redpanda Operator to (including `default`).
33-
* **Delete extra Redpanda Operator deployments**: After upgrading, ensure only one Redpanda Operator remains in the cluster (the one running in cluster scope). Use `helm uninstall` to remove any other Redpanda Operator deployments from previous namespace-scoped installations.
34-
35-
To maintain namespace scope, use the `--set 'additionalCmdFlags=["--namespace=<namespace>"]'` flag when installing or upgrading the Redpanda Operator. The `--namespace` flag in the helm command only specifies which namespace to deploy the Redpanda Operator into and does not affect its operational scope.
36-
37-
WARNING: Do not run multiple Redpanda Operators in different scopes (cluster and namespace scope) in the same cluster as this can cause resource conflicts.
38-
39-
==== RBAC requirements
40-
41-
Important RBAC considerations for v25.2+:
42-
43-
* **ClusterRole permissions always required**: Regardless of whether you use cluster or namespace scope, the Redpanda Operator always needs ClusterRole permissions.
44-
* **Automatic configuration**: These permissions are automatically configured when you install the Redpanda Operator.
45-
46-
=== Declarative role management
47-
48-
Redpanda Operator v25.2.x now includes a RedpandaRole custom resource. The RedpandaRole resource allows you to declaratively manage Redpanda roles and permissions in Kubernetes, making it easier to control access and automate security policies for your Redpanda clusters. See the xref:manage:kubernetes/security/authorization/k-role-controller.adoc[RedpandaRole documentation] for details.
49-
50-
=== Redpanda Console v3 support (Console CRD)
51-
52-
Redpanda Operator v25.2.x introduces support for Redpanda Console v3 through the new Console resource. This allows you to deploy and manage Redpanda Console v3 instances directly from the Redpanda Operator.
53-
54-
The `console` stanza in the Redpanda resource is deprecated and will be removed in a future release.
55-
56-
Existing deployments that use the `console` stanza in the Redpanda resource will be automatically migrated to the Console resource. The migration happens automatically when you upgrade to v25.2.x.
57-
58-
If you manage your resources in version control, you should:
59-
60-
. Fetch and commit the migrated Console CR after the migration completes.
61-
. Remove the `console` stanza from your Redpanda resource after the Console CR is committed to avoid configuration conflicts. Removing the stanza will not affect the migrated Console CR.
62-
63-
The Redpanda Operator handles the migration process from version 2 of Redpanda Console to version 3. If any configurations cannot be migrated, the Redpanda Operator displays warnings in the `warnings` field of the Console resource. If you need to manually migrate any configurations, refer to the xref:migrate:console-v3.adoc[migration guide].
64-
65-
All configuration and management of Redpanda Console should be done through the Console CR. See xref:console:config/configure-console.adoc[].
25+
See xref:manage:kubernetes/shadowing/k-shadow-linking.adoc[Shadow Linking in Kubernetes] for setup and xref:manage:kubernetes/monitoring/k-monitor-shadowing.adoc[monitoring] documentation.

modules/manage/examples/kubernetes/shadow-links.feature

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ Feature: ShadowLink CRDs
2929
When I apply Kubernetes manifest:
3030
"""
3131
---
32+
# tag::basic-shadowlink-example[]
3233
apiVersion: cluster.redpanda.com/v1alpha2
3334
kind: ShadowLink
3435
metadata:
@@ -46,6 +47,7 @@ Feature: ShadowLink CRDs
4647
- name: topic1
4748
filterType: include
4849
patternType: literal
50+
# end::basic-shadowlink-example[]
4951
"""
5052
And shadow link "link" is successfully synced
5153
Then I should find topic "topic1" in cluster "sasl"

modules/manage/pages/disaster-recovery/shadowing/failover-runbook.adoc

Lines changed: 10 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,16 @@ include::shared:partial$enterprise-license.adoc[]
1414
endif::[]
1515

1616
This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.
17+
18+
ifndef::env-cloud[]
19+
NOTE: If you're running Redpanda in Kubernetes, see xref:manage:kubernetes/shadowing/k-failover-runbook.adoc[] for Kubernetes-specific emergency procedures.
20+
endif::[]
21+
1722
// TODO: All command output examples in this guide need verification by running actual commands in test environment
1823

1924
[IMPORTANT]
2025
====
21-
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the xref:manage:disaster-recovery/shadowing/overview.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
26+
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:manage:disaster-recovery/shadowing/failover.adoc[]. Ensure you have completed the xref:manage:disaster-recovery/shadowing/overview.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
2227
====
2328

2429
ifdef::env-cloud[]
@@ -54,19 +59,7 @@ rpk cluster info --brokers shadow-cluster-1.example.com:9092,shadow-cluster-2.ex
5459

5560
**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.
5661

57-
**Examples that require full failover:**
58-
59-
* Primary cluster is completely unreachable (network partition, regional outage)
60-
* Multiple broker failures preventing writes to critical topics
61-
* Data center failure affecting majority of brokers
62-
* Persistent authentication or authorization failures across the cluster
63-
64-
**Examples that may NOT require failover:**
65-
66-
* Single broker failure with sufficient replicas remaining
67-
* Temporary network connectivity issues affecting some clients
68-
* High latency or performance degradation (but cluster still functional)
69-
* Non-critical topic or partition unavailability
62+
include::manage:partial$shadowing/failover-decision-examples.adoc[]
7063

7164
[[verify-shadow-status]]
7265
=== Verify shadow cluster status
@@ -144,9 +137,7 @@ Verify that the following conditions exist before proceeding with failover:
144137

145138
Use xref:reference:rpk/rpk-shadow/rpk-shadow-status.adoc[`rpk shadow status`] or the link:/api/doc/cloud-dataplane/operation/operation-shadowlinkservice_listshadowlinktopics[Data Plane API] to check lag, which shows the message count difference between source and shadow partitions:
146139

147-
* **Acceptable lag examples**: 0-1000 messages for low-throughput topics, 0-10000 messages for high-throughput topics
148-
* **Concerning lag examples**: Growing lag over 50,000 messages, or lag that continuously increases without recovering
149-
* **Critical lag examples**: Lag exceeding your data loss tolerance (for example, if you can only afford to lose 1 minute of data, lag should represent less than 1 minute of typical message volume)
140+
include::manage:partial$shadowing/replication-lag-guidelines.adoc[]
150141

151142
[[document-state]]
152143
=== Document current state
@@ -241,7 +232,7 @@ ifdef::env-cloud[high_watermark]
241232

242233
[IMPORTANT]
243234
====
244-
Note the replication lag to estimate potential data loss during failover. The `Tasks` section shows the health of shadow link replication tasks. For details about what each task does, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
235+
Note the replication lag to estimate potential data loss during failover. The `Tasks` section shows the health of shadow link replication tasks. For details about what each task does, see xref:manage:disaster-recovery/shadowing/overview.adoc#shadow-link-tasks[Shadow link tasks].
245236
====
246237

247238
[[initiate-failover]]
@@ -574,22 +565,6 @@ Force deleting a shadow link immediately fails over all topics in the link. This
574565

575566
**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions. See link:https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda[How to manage consumer group offsets in Redpanda^] for detailed reset procedures.
576567

577-
== Next steps
578-
579-
After successful failover, focus on recovery planning and process improvement. Begin by assessing the source cluster failure and determining whether to restore the original cluster or permanently promote the shadow cluster as your new primary.
580-
581-
**Immediate recovery planning:**
582-
583-
1. **Assess source cluster**: Determine root cause of the outage
584-
2. **Plan recovery**: Decide whether to restore source cluster or promote shadow cluster permanently
585-
3. **Data synchronization**: Plan how to synchronize any data produced during failover
586-
4. **Fail forward**: Create a new shadow link with the failed over shadow cluster as source to maintain a DR cluster
587-
588-
**Process improvement:**
589-
590-
1. **Document the incident**: Record timeline, impact, and lessons learned
591-
2. **Update runbooks**: Improve procedures based on what you learned
592-
3. **Test regularly**: Schedule regular disaster recovery drills
593-
4. **Review monitoring**: Ensure monitoring caught the issue appropriately
568+
include::manage:partial$shadowing/failover-next-steps.adoc[]
594569

595570
// end::single-source[]

modules/manage/pages/disaster-recovery/shadowing/failover.adoc

Lines changed: 9 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -18,26 +18,18 @@ You can failover a shadow link using the Redpanda Cloud UI, `rpk`, or the Data P
1818
endif::[]
1919

2020
ifndef::env-cloud[]
21-
You can failover a shadow link using Redpanda Console, `rpk`, or the Admin API.
21+
You can failover a shadow link using Redpanda Console, `rpk`, or the Admin API.
22+
23+
NOTE: If you are using Kubernetes, you can also use the Redpanda Operator's `ShadowLink` resource to manage failover. See xref:manage:kubernetes/shadowing/k-failover-runbook.adoc[Kubernetes Shadow Link Failover] for details.
2224
endif::[]
2325

24-
include::shared:partial$emergency-shadowing-callout.adoc[]
26+
include::manage:partial$shadowing/emergency-shadowing-callout.adoc[]
2527

2628
ifdef::env-cloud[]
27-
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
29+
NOTE: Shadowing is supported on BYOC and Dedicated clusters running Redpanda version 25.3 and later.
2830
endif::[]
2931

30-
== Failover behavior
31-
32-
When you initiate failover, Redpanda performs the following operations:
33-
34-
1. **Stops replication**: Halts all data fetching from the source cluster for the specified topics or entire shadow link
35-
2. **Failover topics**: Converts read-only shadow topics into regular, writable topics
36-
3. **Updates topic state**: Changes topic status from `ACTIVE` to `FAILING_OVER`, then `FAILED_OVER`
37-
38-
Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.
39-
40-
NOTE: To avoid a split-brain scenario after failover, ensure that all clients are reconfigured to point to the shadow cluster before resuming write activity.
32+
include::manage:partial$shadowing/failover-behavior.adoc[]
4133

4234
== Failover commands
4335

@@ -206,26 +198,7 @@ endif::[]
206198
Force deleting a shadow link is irreversible and immediately fails over all topics in the link, bypassing the normal failover state transitions. This action should only be used as a last resort when topics are stuck in transitional states and you need immediate access to all replicated data.
207199
====
208200

209-
== Failover states
210-
211-
=== Shadow link states
212-
213-
The shadow link itself has a simple state model:
214-
215-
* **`ACTIVE`**: Shadow link is operating normally, replicating data
216-
* **`PAUSED`**: Shadow link replication is temporarily halted by user action
217-
218-
Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.
219-
220-
=== Shadow topic states
221-
222-
Individual shadow topics progress through specific states during failover:
223-
224-
* **`ACTIVE`**: Normal replication state before failover
225-
* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
226-
* **`FAILING_OVER`**: Failover initiated, replication stopping
227-
* **`FAILED_OVER`**: Failover completed successfully, topic fully writable
228-
* **`PAUSED`**: Replication temporarily halted by user action
201+
include::manage:partial$shadowing/failover-states.adoc[]
229202

230203
== Monitor failover progress
231204

@@ -277,7 +250,7 @@ Task states during monitoring:
277250
* **`NOT_RUNNING`**: Task is not currently executing
278251
* **`LINK_UNAVAILABLE`**: Task cannot communicate with the source cluster
279252

280-
For detailed information about shadow link tasks and their roles, see xref:manage:disaster-recovery/shadowing/setup.adoc#shadow-link-tasks[Shadow link tasks].
253+
For detailed information about shadow link tasks and their roles, see xref:manage:disaster-recovery/shadowing/overview.adoc#shadow-link-tasks[Shadow link tasks].
281254

282255

283256
== Post-failover cluster behavior
@@ -333,6 +306,6 @@ After completing failover:
333306
* Verify that applications can produce and consume messages normally
334307
* Consider deleting the shadow link if failover was successful and permanent
335308

336-
For emergency situations, see xref:./failover-runbook.adoc[Failover Runbook].
309+
For emergency situations, see xref:manage:disaster-recovery/shadowing/failover-runbook.adoc[Failover Runbook].
337310

338311
// end::single-source[]

0 commit comments

Comments
 (0)