Skip to content

Commit 7890a5e

Browse files
authored
Add a doc about maintaining a node (#2910)
1 parent a678898 commit 7890a5e

File tree

6 files changed

+307
-9
lines changed

6 files changed

+307
-9
lines changed

en/TOC.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
- [Suspend and Resume a TiDB Cluster](suspend-tidb-cluster.md)
4646
- [Restart a TiDB Cluster](restart-a-tidb-cluster.md)
4747
- [Destroy a TiDB Cluster](destroy-a-tidb-cluster.md)
48+
- [Maintain Kubernetes Nodes](maintain-a-kubernetes-node.md)
4849
- Troubleshoot
4950
- [Troubleshooting Tips](tips.md)
5051
- [Deployment Failures](deploy-failures.md)

en/maintain-a-kubernetes-node.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
title: Maintain Kubernetes Nodes That Hold the TiDB Cluster
3+
summary: Learn how to maintain Kubernetes nodes that hold the TiDB cluster.
4+
---
5+
6+
# Maintain Kubernetes Nodes That Hold the TiDB Cluster
7+
8+
TiDB is a highly available database that can run smoothly when some of the database nodes go offline. Therefore, you can safely shut down and maintain the Kubernetes nodes that host TiDB clusters.
9+
10+
This document describes how to perform maintenance operations on Kubernetes nodes based on maintenance duration and storage type.
11+
12+
## Prerequisites
13+
14+
- Install [`kubectl`](https://kubernetes.io/docs/tasks/tools/).
15+
16+
> **Note:**
17+
>
18+
> Before you maintain a node, make sure that the remaining resources in the Kubernetes cluster are enough for running the TiDB cluster.
19+
20+
## Maintain a node
21+
22+
### Step 1: Preparation
23+
24+
1. Use the `kubectl cordon` command to mark the node to be maintained as unschedulable to prevent new Pods from being scheduled to this node:
25+
26+
```shell
27+
kubectl cordon ${node_name}
28+
```
29+
30+
2. Check whether any TiDB cluster component Pods are running on the node to be maintained:
31+
32+
```shell
33+
kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
34+
```
35+
36+
- If the node has TiDB cluster component Pods, follow the subsequent steps in this document to migrate these Pods.
37+
- If the node does not have any TiDB cluster component Pods, there is no need to migrate Pods, and you can proceed directly with node maintenance.
38+
39+
### Step 2: Migrate TiDB cluster component Pods
40+
41+
Based on the storage type of the Kubernetes node, choose the corresponding Pod migration strategy:
42+
43+
- **Automatically migratable storage**: use [Method 1: Reschedule Pods](#method-1-reschedule-pods-for-automatically-migratable-storage).
44+
- **Non-automatically migratable storage**: use [Method 2: Recreate instances](#method-2-recreate-instances-for-local-storage).
45+
46+
#### Method 1: Reschedule Pods (for automatically migratable storage)
47+
48+
If you use storage that supports automatic migration (such as [Amazon EBS](https://aws.amazon.com/ebs/)), you can reschedule component Pods by following [Perform a graceful restart of a single Pod in a component](restart-a-tidb-cluster.md#perform-a-graceful-restart-of-a-single-pod-in-a-component). The following instructions take rescheduling PD Pods as an example:
49+
50+
1. Check the PD Pod on the node to be maintained:
51+
52+
```shell
53+
kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
54+
```
55+
56+
2. Get the instance name of the PD Pod:
57+
58+
```shell
59+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
60+
```
61+
62+
3. Add a new label to the PD instance to trigger rescheduling:
63+
64+
```shell
65+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
66+
```
67+
68+
4. Confirm that the PD Pod is successfully scheduled to another node:
69+
70+
```shell
71+
watch kubectl -n ${namespace} get pod -o wide
72+
```
73+
74+
5. Follow the same steps to migrate Pods of other components such as TiKV and TiDB until all TiDB cluster component Pods on the node are migrated.
75+
76+
#### Method 2: Recreate instances (for local storage)
77+
78+
If the node uses storage that cannot be automatically migrated (such as local storage), you need to recreate instances.
79+
80+
> **Warning:**
81+
>
82+
> Recreating instances causes data loss. For stateful components such as TiKV, ensure that the cluster has sufficient replicas to guarantee data safety.
83+
84+
The following instructions take recreating a TiKV instance as an example:
85+
86+
1. Delete the CR of the TiKV instance. TiDB Operator automatically deletes the associated PVC and ConfigMap resources, and creates a new instance:
87+
88+
```shell
89+
kubectl delete -n ${namespace} tikv ${tikv_instance_name}
90+
```
91+
92+
2. Wait for the status of the newly created TiKV instance to become `Ready`:
93+
94+
```shell
95+
kubectl get -n ${namespace} tikv ${tikv_instance_name}
96+
```
97+
98+
3. After you confirm that the TiDB cluster status is normal and data synchronization is completed, continue to maintain other components.
99+
100+
### Step 3: Confirm migration completion
101+
102+
After you complete Pod migration, only the Pods managed by DaemonSet (such as network plugins and monitoring agents) should be running on the node:
103+
104+
```shell
105+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
106+
```
107+
108+
### Step 4: Perform node maintenance
109+
110+
You can now safely perform maintenance operations on the node, such as restarting, updating the operating system, or performing hardware maintenance.
111+
112+
### Step 5: Recover after maintenance (for temporary maintenance only)
113+
114+
If you plan to perform long-term maintenance or permanently take the node offline, skip this step.
115+
116+
For temporary maintenance, perform the following recovery operations after the node maintenance is completed:
117+
118+
1. Check the node health status:
119+
120+
```shell
121+
watch kubectl get node ${node_name}
122+
```
123+
124+
When the node status becomes `Ready`, continue to the next step.
125+
126+
2. Use the `kubectl uncordon` command to remove the scheduling restriction on the node:
127+
128+
```shell
129+
kubectl uncordon ${node_name}
130+
```
131+
132+
3. Check whether all Pods are running normally:
133+
134+
```shell
135+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
136+
```
137+
138+
When all Pods are running normally, the maintenance operation is completed.

en/restart-a-tidb-cluster.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,18 @@ For a TiKV Pod, specify the `--grace-period` option when deleting the Pod to pro
4040
kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
4141
```
4242

43-
For other component Pods, you can delete them directly, because TiDB Operator will automatically handle a graceful restart:
43+
For Pods of other components, you can perform a graceful restart by adding a label or annotation to its corresponding Instance CR. The following uses the PD component as an example:
4444

45-
```shell
46-
kubectl -n ${namespace} delete pod ${pod_name}
47-
```
45+
1. Query the PD Instance CR from the Pod:
46+
47+
```shell
48+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
49+
```
50+
51+
2. Add a new label to the PD instance to trigger a restart. For example:
52+
53+
```shell
54+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
55+
```
56+
57+
3. If this PD instance is the leader, TiDB Operator first transfers the leader role to another PD instance and then restarts the Pod.

zh/TOC.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
- [挂起和恢复 TiDB 集群](suspend-tidb-cluster.md)
4646
- [重启 TiDB 集群](restart-a-tidb-cluster.md)
4747
- [销毁 TiDB 集群](destroy-a-tidb-cluster.md)
48+
- [维护 TiDB 集群所在的 Kubernetes 节点](maintain-a-kubernetes-node.md)
4849
- 故障诊断
4950
- [使用技巧](tips.md)
5051
- [部署错误](deploy-failures.md)

zh/maintain-a-kubernetes-node.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
title: 维护 TiDB 集群所在的 Kubernetes 节点
3+
summary: 介绍如何维护 TiDB 集群所在的 Kubernetes 节点。
4+
---
5+
6+
# 维护 TiDB 集群所在的 Kubernetes 节点
7+
8+
TiDB 是高可用数据库,即使部分节点下线,集群也能正常运行。因此,你可以安全地对 TiDB 集群所在的 Kubernetes 节点执行停机维护操作。
9+
10+
本文介绍在不同存储类型和维护时长下,如何安全地维护 Kubernetes 节点。
11+
12+
## 前提条件
13+
14+
- 安装 [`kubectl`](https://kubernetes.io/zh-cn/docs/tasks/tools/)
15+
16+
> **注意:**
17+
>
18+
> 维护节点前,请确保 Kubernetes 集群的剩余资源足以支撑 TiDB 集群的正常运行。
19+
20+
## 维护节点步骤
21+
22+
### 第 1 步:准备工作
23+
24+
1. 使用 `kubectl cordon` 命令将待维护节点标记为不可调度,防止新的 Pod 被调度到该节点:
25+
26+
```shell
27+
kubectl cordon ${node_name}
28+
```
29+
30+
2. 检查待维护节点上是否运行 TiDB 集群组件的 Pod:
31+
32+
```shell
33+
kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
34+
```
35+
36+
- 如果节点上存在 TiDB 集群组件的 Pod,请按照后续步骤迁移这些 Pod。
37+
- 如果节点上没有 TiDB 集群组件的 Pod,则无需迁移 Pod,可直接进行节点维护。
38+
39+
### 第 2 步:迁移 TiDB 集群组件 Pod
40+
41+
根据 Kubernetes 节点的存储类型,选择相应的 Pod 迁移策略:
42+
43+
- **可自动迁移存储**:使用[方法 1:重调度 Pod](#方法-1重调度-pod适用于可自动迁移的存储)
44+
- **不可自动迁移存储**:使用[方法 2:重建实例](#方法-2重建实例适用于本地存储)
45+
46+
#### 方法 1:重调度 Pod(适用于可自动迁移的存储)
47+
48+
如果 Kubernetes 节点使用的存储支持自动迁移(如 [Amazon EBS](https://aws.amazon.com/cn/ebs/)),可以通过[优雅重启某个组件的单个 Pod](restart-a-tidb-cluster.md#优雅重启某个组件的单个-pod) 的方式重调度各个组件 Pod。以 PD 组件为例:
49+
50+
1. 查看待维护节点上的 PD Pod:
51+
52+
```shell
53+
kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
54+
```
55+
56+
2. 获取该 PD Pod 对应的实例名称:
57+
58+
```shell
59+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
60+
```
61+
62+
3. 为该 PD 实例添加一个新标签以触发重调度:
63+
64+
```shell
65+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
66+
```
67+
68+
4. 确认该 PD Pod 已成功调度到其他节点:
69+
70+
```shell
71+
watch kubectl -n ${namespace} get pod -o wide
72+
```
73+
74+
5. 按相同步骤迁移 TiKV、TiDB 等其他组件 Pod,直至该维护节点上的所有 TiDB 集群组件 Pod 都迁移完成。
75+
76+
#### 方法 2:重建实例(适用于本地存储)
77+
78+
如果 Kubernetes 节点使用的存储不支持自动迁移(如本地存储),你需要重建实例。
79+
80+
> **警告:**
81+
>
82+
> 重建实例会导致数据丢失。对于 TiKV 等有状态组件,请确保集群副本数充足,以保障数据安全。
83+
84+
以重建 TiKV 实例为例:
85+
86+
1. 删除 TiKV 实例的 CR。TiDB Operator 会自动删除其关联的 PVC 和 ConfigMap 等资源,并创建新实例:
87+
88+
```shell
89+
kubectl delete -n ${namespace} tikv ${tikv_instance_name}
90+
```
91+
92+
2. 等待新创建的 TiKV 实例状态变为 `Ready`
93+
94+
```shell
95+
kubectl get -n ${namespace} tikv ${tikv_instance_name}
96+
```
97+
98+
3. 确认 TiDB 集群状态正常且数据同步完成后,再继续维护其他组件。
99+
100+
### 第 3 步:确认迁移完成
101+
102+
完成 Pod 迁移后,该节点上应仅运行由 DaemonSet 管理的 Pod(如网络插件、监控代理等):
103+
104+
```shell
105+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
106+
```
107+
108+
### 第 4 步:执行节点维护
109+
110+
现在,你可以安全地对节点执行维护操作,例如重启、更新操作系统或进行硬件维护。
111+
112+
### 第 5 步:维护后恢复(仅适用于临时维护)
113+
114+
如果计划长期维护或永久下线节点,请跳过此步骤。
115+
116+
对于临时维护,节点维护完成后需要执行以下恢复操作:
117+
118+
1. 确认节点健康状态:
119+
120+
```shell
121+
watch kubectl get node ${node_name}
122+
```
123+
124+
当节点状态变为 `Ready` 后,继续下一步。
125+
126+
2. 使用 `kubectl uncordon` 命令解除节点的调度限制:
127+
128+
```shell
129+
kubectl uncordon ${node_name}
130+
```
131+
132+
3. 观察 Pod 是否全部恢复正常运行:
133+
134+
```shell
135+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
136+
```
137+
138+
当所有 Pod 正常运行后,维护操作完成。

zh/restart-a-tidb-cluster.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,24 @@ spec:
3434
3535
你可以单独重启 TiDB 集群中的特定 Pod。不同组件的 Pod,操作略有不同。
3636
37-
对于 TiKV Pod,为确保有足够时间驱逐 Region leader,在删除 Pod 时需要指定 `--grace-period` 选项,否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期:
37+
对于 TiKV Pod,为确保有足够时间驱逐 Region Leader,在删除 Pod 时需要指定 `--grace-period` 选项,否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期:
3838

3939
```shell
4040
kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
4141
```
4242

43-
其他组件的 Pod 可以直接删除,TiDB Operator 会自动优雅重启这些 Pod
43+
对于其他组件的 Pod,可以通过给 Pod 对应的实例 (Instance CR) 添加标签或注解的方式实现优雅重启。以 PD 为例
4444

45-
```shell
46-
kubectl -n ${namespace} delete pod ${pod_name}
47-
```
45+
1. 根据 Pod 查询对应的 PD Instance CR:
46+
47+
```shell
48+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
49+
```
50+
51+
2. 给该 PD 实例添加新标签以触发重启,例如:
52+
53+
```shell
54+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
55+
```
56+
57+
3. 如果该 PD 为 Leader,TiDB Operator 会先将 Leader 迁移到其他 PD,再重启该 Pod。

0 commit comments

Comments
 (0)