Skip to content

Commit b264be4

Browse files
DOC-875 Add section on disabling automatic node maintenance and OS upgrades (#941)
Co-authored-by: Joyce Fee <[email protected]>
1 parent 97a0e3a commit b264be4

File tree

2 files changed

+49
-15
lines changed

2 files changed

+49
-15
lines changed

modules/deploy/pages/deployment-option/self-hosted/kubernetes/k-deployment-overview.adoc

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ Managed Kubernetes services, such as Google Kubernetes Engine (GKE) and Amazon E
9696

9797
You remain responsible for deploying and maintaining Redpanda instances on worker nodes.
9898

99+
IMPORTANT: Deploy Kubernetes clusters with *unmanaged (manual) node updates*. Managed (automatic) updates during cluster deployment can lead to service downtime, data loss, or quorum instability. Transitioning from managed updates to unmanaged updates after deployment may require downtime. To avoid these disruptions, plan for unmanaged node updates from the start. See xref:deploy:deployment-option/self-hosted/kubernetes/k-requirements.adoc#node-updates[Kubernetes Cluster Requirements and Recommendations].
100+
99101
=== Bare-metal Kubernetes environments
100102

101103
Bare-metal Kubernetes environments give you complete control over both the control plane and the worker nodes, which can be advantageous when you want the following:
@@ -113,14 +115,15 @@ This documentation follows conventions to help users easily identify Kubernetes
113115

114116
== Next steps
115117

116-
Whether you're deploying locally or in the cloud, choose one of the following guides to get started:
118+
- Get started
119+
** xref:./local-guide.adoc[Local Deployment Guide] (kind and minikube)
120+
** xref:./aks-guide.adoc[Azure Kubernetes Service Guide] (AKS)
121+
** xref:./eks-guide.adoc[Elastic Kubernetes Service Guide] (EKS)
122+
** xref:./gke-guide.adoc[Google Kubernetes Engine Guide] (GKE)
117123

118-
* xref:./local-guide.adoc[Local Deployment Guide] (kind and minikube)
119-
* xref:./aks-guide.adoc[Azure Kubernetes Service Guide] (AKS)
120-
* xref:./eks-guide.adoc[Elastic Kubernetes Service Guide] (EKS)
121-
* xref:./gke-guide.adoc[Google Kubernetes Engine Guide] (GKE)
124+
- xref:deploy:deployment-option/self-hosted/kubernetes/k-requirements.adoc[Kubernetes Cluster Requirements and Recommendations]
122125

123-
Or, explore our xref:./k-production-workflow.adoc[production workflow] to learn about requirements and best practices.
126+
- xref:./k-production-workflow.adoc[Production deployment workflow]
124127

125128
include::shared:partial$suggested-reading.adoc[]
126129

modules/deploy/partials/requirements.adoc

Lines changed: 40 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -31,17 +31,17 @@ https://helm.sh/docs/intro/install/[Install Helm^].
3131
endif::[]
3232

3333
[[number-of-workers]]
34-
== Number of {node}s
34+
== Number of nodes
3535

3636
Provision one physical node or virtual machine (VM) for each Redpanda broker that you plan to deploy in your Redpanda cluster.
37-
Each Redpanda broker requires its own dedicated {node} for the following reasons:
37+
Each Redpanda broker requires its own dedicated node for the following reasons:
3838

39-
- *Resource isolation*: Redpanda brokers are designed to make full use of available system resources, including CPU and memory. By dedicating a {node} to each broker, you ensure that these resources aren't shared with other applications or processes, avoiding potential performance bottlenecks or contention.
40-
- *External networking*: External clients should connect directly to the broker that owns the partition they're interested in. This means that each broker must be individually addressable. As clients must connect to the specific broker that is the leader of the partition, they need a mechanism to directly address each broker in the cluster. Assigning each broker to its own dedicated {node} makes this direct addressing feasible, since each {node} will have a unique address. See <<External networking>>.
39+
- *Resource isolation*: Redpanda brokers are designed to make full use of available system resources, including CPU and memory. By dedicating a node to each broker, you ensure that these resources aren't shared with other applications or processes, avoiding potential performance bottlenecks or contention.
40+
- *External networking*: External clients should connect directly to the broker that owns the partition they're interested in. This means that each broker must be individually addressable. As clients must connect to the specific broker that is the leader of the partition, they need a mechanism to directly address each broker in the cluster. Assigning each broker to its own dedicated node makes this direct addressing feasible, since each node will have a unique address. See <<External networking>>.
4141
- *Fault tolerance*: Ensuring each broker operates on a separate node enhances fault tolerance. If one node experiences issues, it won't directly impact the other brokers.
4242

4343
ifdef::env-kubernetes[]
44-
NOTE: The Redpanda Helm chart configures xref:reference:k-redpanda-helm-spec.adoc#statefulset-podantiaffinity[`podAntiAffinity` rules] to make sure that each Redpanda broker runs on its own {node}.
44+
NOTE: The Redpanda Helm chart configures xref:reference:k-redpanda-helm-spec.adoc#statefulset-podantiaffinity[`podAntiAffinity` rules] to make sure that each Redpanda broker runs on its own node.
4545

4646

4747
*Recommendations*: xref:./kubernetes-deploy.adoc#pod-replicas[Deploy at least three Pod replicas].
@@ -51,11 +51,42 @@ ifndef::env-kubernetes[]
5151
*Recommendations*: Deploy at least three Redpanda brokers.
5252
endif::[]
5353

54+
[[node-updates]]
55+
== Node maintenance and operating system upgrades
56+
57+
Ensure that node and operating system (OS) upgrades are manually managed when running Redpanda in production. Manual control avoids unplanned reboots or replacements that disrupt Redpanda brokers, causing service downtime, data loss, or quorum instability.
58+
59+
=== Limitations of automatic updates
60+
61+
Redpanda is stateful. Redpanda brokers manage partition data and leadership, making them sensitive to disruptions. Proper handling during maintenance is required to:
62+
63+
- Avoid data loss, especially for nodes with ephemeral or local storage.
64+
- Ensure smooth leadership transitions by decommissioning brokers before removing a node.
65+
- Minimize service downtime by upgrading nodes one at a time during planned maintenance windows.
66+
67+
However, automatic update mechanisms provided by cloud platforms may not meet Redpanda's stateful requirements. Common issues include:
68+
69+
- Hard timeouts for graceful shutdowns that may not allow Redpanda brokers enough time to complete decommissioning or leadership transitions.
70+
- Replacements or reboots without ensuring data has been safely migrated or replicated, risking data loss.
71+
- Parallel upgrades across multiple nodes, which can disrupt quorum or reduce cluster availability.
72+
73+
*Recommendations*:
74+
75+
- Disable automatic node maintenance or upgrades.
76+
ifdef::env-kubernetes[]
77+
To prevent managed Kubernetes services from automatically rebooting or upgrading nodes:
78+
** **Azure AKS**: Set the OS upgrade channel to `None`. https://learn.microsoft.com/en-us/azure/aks/auto-upgrade-node-os-image[Azure Documentation^].
79+
** **Google GKE**: Disable GKE auto-upgrades for node pools. https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-upgrades[GCP Documentation^].
80+
** **Amazon EKS**: Avoid enabling EKS node auto-upgrades. https://docs.aws.amazon.com/eks/latest/userguide/worker.html[AWS Documentation^].
81+
- xref:upgrade:k-upgrade-kubernetes.adoc[Manually manage node upgrades].
82+
endif::[]
83+
84+
5485
== CPU and memory
5586

5687
*Requirements*:
5788

58-
- Two physical, not virtual, cores for each {node}.
89+
- Two physical, not virtual, cores for each node.
5990

6091
- x86_64 (Westmere or newer) and AWS Graviton family processors are supported.
6192

@@ -65,7 +96,7 @@ endif::[]
6596

6697
*Recommendations*:
6798

68-
- Four physical cores for each {node} are strongly recommended.
99+
- Four physical cores for each node are strongly recommended.
69100

70101
ifdef::env-kubernetes[]
71102
- xref:./kubernetes-deploy.adoc#resources[Set resource requests and limits for memory and CPU].
@@ -106,7 +137,7 @@ endif::[]
106137

107138
== External networking
108139

109-
- For external access, each {node} in your cluster must have a static, externally accessible IP address.
140+
- For external access, each node in your cluster must have a static, externally accessible IP address.
110141

111142
- Minimum 10 GigE (10 Gigabit Ethernet) connection to ensure:
112143

@@ -120,7 +151,7 @@ endif::[]
120151

121152
== Tuning
122153

123-
Before deploying Redpanda to production, each {node} that runs Redpanda must be tuned to optimize the Linux kernel for Redpanda processes.
154+
Before deploying Redpanda to production, each node that runs Redpanda must be tuned to optimize the Linux kernel for Redpanda processes.
124155

125156
ifdef::env-kubernetes[]
126157
See xref:deploy:deployment-option/self-hosted/kubernetes/k-tune-workers.adoc[].

0 commit comments

Comments
 (0)