Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions local-antora-playbook.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,15 @@ antora:
filter: docker-compose
env_type: Docker
attribute_name: docker-labs-index
- require: '@sntke/antora-mermaid-extension'
mermaid_library_url: https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs
script_stem: mermaid-scripts
mermaid_initialize_options:
start_on_load: true
theme: base
theme_variables:
line_color: '#e2401b'
font_family: Inter, sans-serif
- require: '@redpanda-data/docs-extensions-and-macros/extensions/collect-bloblang-samples'
- require: '@redpanda-data/docs-extensions-and-macros/extensions/generate-rp-connect-categories'
- require: '@redpanda-data/docs-extensions-and-macros/extensions/modify-redirects'
Expand Down
1 change: 1 addition & 0 deletions modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@
*** xref:manage:kubernetes/k-remote-read-replicas.adoc[Remote Read Replicas]
*** xref:manage:kubernetes/k-manage-resources.adoc[Manage Pod Resources]
*** xref:manage:kubernetes/k-scale-redpanda.adoc[Scale]
*** xref:manage:kubernetes/k-nodewatcher.adoc[]
*** xref:manage:kubernetes/k-decommission-brokers.adoc[Decommission Brokers]
*** xref:manage:kubernetes/k-recovery-mode.adoc[Recovery Mode]
*** xref:manage:kubernetes/monitoring/index.adoc[Monitor]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Rack awareness is just one aspect of availability. Check out xref:deploy:deploym

=== Cost

Infrastructure costs increase with each broker, so adding a broker means an additional instance to pay for. In this example we deploy to GKE on seven https://gcloud-compute.com/n2-standard-8.html[n2-standard-8^] GCP instances. This means that the instance cost of the cluster is around $1.9K per month. Dropping down to 5 brokers would save over $500 per month, and dropping down to 3 brokers would save around $1100 per month. Of course, there are other costs to consider, but they won't be as impacted by changing the broker count.
Infrastructure costs increase with each broker because each broker requires a dedicated node (instance), so adding a broker means an additional instance cost. For example, if the instance cost is $1925 per month in a cluster with seven brokers, the instance cost for each broker is $275. Reducing the number of brokers from seven to five would save $550 per month ($275 x 2), and reducing it to three brokers would save $1100 per month. You must also consider other costs, but they won't be as impacted by changing the broker count.

=== Data retention

Expand Down
333 changes: 219 additions & 114 deletions modules/manage/pages/kubernetes/k-decommission-brokers.adoc

Large diffs are not rendered by default.

207 changes: 207 additions & 0 deletions modules/manage/pages/kubernetes/k-nodewatcher.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
= Install the Nodewatcher Controller
:page-categories: Management
:env-kubernetes: true
:description: pass:q[The Nodewatcher controller is an emergency backstop for Redpanda clusters that use PersistentVolumes (PVs) for the Redpanda data directory. When a node running a Redpanda Pod suddenly goes offline, Nodewatcher detects the lost node, retains the associated PV, and removes the corresponding PersistentVolumeClaim (PVC). This workflow allows the Redpanda Pod to be rescheduled on a new node without losing critical data.]

{description}

:warning-caption: Emergency use only

[WARNING]
====
The Nodewatcher controller is intended only for emergency scenarios (for example, node hardware or infrastructure failures). *Never use the Nodewatcher controller as a routine method for removing brokers.* If you want to remove brokers, see xref:manage:kubernetes/k-decommission-brokers.adoc[Decommission brokers] for the correct procedure.
====

:warning-caption: Warning

== Why use Nodewatcher?

If a worker node hosting a Redpanda Pod suddenly fails or disappears, Kubernetes might leave the associated PV and PVC in an _attached_ or _in-use_ state. Without Nodewatcher (or manual intervention), the Redpanda Pod cannot safely reschedule to another node because the volume is still recognized as occupied. Also, the default reclaim policy might delete the volume, risking data loss. Nodewatcher automates the steps needed to retain the volume and remove the stale PVC, so Redpanda Pods can move to healthy nodes without losing the data in the original PV.

== How Nodewatcher works

When the controller detects events that indicate a Node resource is no longer available, it does the following:

- For each Redpanda Pod on that Node, it identifies the PVC (if any) the Pod was using for its storage.
- It sets the reclaim policy of the affected PersistentVolume (PV) to `Retain`.
- It deletes the associated PersistentVolumeClaim (PVC) to allow the Redpanda broker Pod to reschedule onto a new, operational node.

[mermaid]
....
flowchart TB
%% Define classes
classDef systemAction fill:#F6FBF6,stroke:#25855a,stroke-width:2px,color:#20293c,rx:5,ry:5

A[Node fails] --> B{Is Node<br>running Redpanda?}:::systemAction
B -- Yes --> C[Identify Redpanda Pod PVC]:::systemAction
C --> D[Set PV reclaim policy to 'Retain']:::systemAction
D --> E[Delete PVC]:::systemAction
E --> F[Redpanda Pod<br>is rescheduled]:::systemAction
B -- No --> G[Ignore event]:::systemAction
....


== Install Nodewatcher

[tabs]
======
Helm + Operator::
+
--

You can install the Nodewatcher controller as part of the Redpanda Operator or as a sidecar on each Pod that runs a Redpanda broker. When you install the controller as part of the Redpanda Operator, the controller monitors all Redpanda clusters running in the same namespace as the Redpanda Operator. If you want the controller to manage only a single Redpanda cluster, install it as a sidecar on each Pod that runs a Redpanda broker, using the Redpanda resource.

To install the Nodewatcher controller as part of the Redpanda Operator:

. Deploy the Redpanda Operator with the Nodewatcher controller:
+
[,bash,subs="attributes+",lines=7+8]
----
helm repo add redpanda https://charts.redpanda.com
helm repo update
helm upgrade --install redpanda-controller redpanda/operator \
--namespace <namespace> \
--set image.tag={latest-operator-version} \
--create-namespace \
--set additionalCmdFlags={--additional-controllers="nodeWatcher"} \
--set rbac.createAdditionalControllerCRs=true
----
+
- `--additional-controllers="nodeWatcher"`: Enables the Nodewatcher controller.
- `--rbac.createAdditionalControllerCRs=true`: Creates the required RBAC rules for the Redpanda Operator to monitor the Node resources and update PVCs and PVs.

. Deploy a Redpanda resource:
+
.`redpanda-cluster.yaml`
[,yaml]
----
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
name: redpanda
spec:
chartRef: {}
clusterSpec: {}
----
+
```bash
kubectl apply -f redpanda-cluster.yaml --namespace <namespace>
```

To install the Decommission controller as a sidecar:

.`redpanda-cluster.yaml`
[,yaml,lines=11+13+15]
----
apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
metadata:
name: redpanda
spec:
chartRef: {}
clusterSpec:
statefulset:
sideCars:
controllers:
enabled: true
run:
- "nodeWatcher"
rbac:
enabled: true
----

- `statefulset.sideCars.controllers.enabled`: Enables the controllers sidecar.
- `statefulset.sideCars.controllers.run`: Enables the Nodewatcher controller.
- `rbac.enabled`: Creates the required RBAC rules for the controller to monitor the Node resources and update PVCs and PVs.

--
Helm::
+
--
[tabs]
====
--values::
+
.`decommission-controller.yaml`
[,yaml,lines=4+6+8]
----
statefulset:
sideCars:
controllers:
enabled: true
run:
- "nodeWatcher"
rbac:
enabled: true
----
+
- `statefulset.sideCars.controllers.enabled`: Enables the controllers sidecar.
- `statefulset.sideCars.controllers.run`: Enables the Nodewatcher controller.
- `rbac.enabled`: Creates the required RBAC rules for the controller to monitor the Node resources and update PVCs and PVs.

--set::
+
[,bash,lines=4-6]
----
helm upgrade --install redpanda redpanda/redpanda \
--namespace <namespace> \
--create-namespace \
--set statefulset.sideCars.controllers.enabled=true \
--set statefulset.sideCars.controllers.run={"nodeWatcher"} \
--set rbac.enabled=true
----
+
- `statefulset.sideCars.controllers.enabled`: Enables the controllers sidecar.
- `statefulset.sideCars.controllers.run`: Enables the Nodewatcher controller.
- `rbac.enabled`: Creates the required RBAC rules for the controller to monitor the Node resources and update PVCs and PVs.

====
--
======

== Test the Nodewatcher controller

. Test the Nodewatcher controller by deleting a Node resource:
+
[,bash]
----
kubectl delete node <node-name>
----
+
NOTE: This step is for testing purposes only.

. Monitor the logs of the Nodewatcher controller:
+
--
- If you're running the Nodewatcher controller as part of the Redpanda Operator:
+
[,bash]
----
kubectl logs -l app.kubernetes.io/name=operator -c manager --namespace <namespace>
----

- If you're running the Nodewatcher controller as a sidecar:
+
[,bash]
----
kubectl logs <pod-name> --namespace <namespace> -c redpanda-controllers
----
--
+
You should see that the controller successfully deleted the PVC of the Pod that was running on the deleted Node resource.
+
[,bash]
----
kubectl get persistentvolumeclaim --namespace <namespace>
----

. Verify that the reclaim policy of the PV is set to `Retain` to allow you to recover the node, if necessary:
+
[,bash]
----
kubectl get persistentvolume --namespace <namespace>
----

After the Nodewatcher controller has finished, xref:manage:kubernetes/k-decommission-brokers.adoc[decommission the broker] that was removed from the node. This is necessary to prevent a potential loss of quorum and ensure cluster stability.

NOTE: Make sure to use the `--force` flag when decommissioning the broker with xref:reference:rpk/rpk-redpanda/rpk-redpanda-admin-brokers-decommission.adoc[`rpk redpanda admin brokers decommission`]. This flag is required when the broker is no longer running.
Loading
Loading