Skip to content

Commit bf2445e

Browse files
committed
add a doc about maintaining a node
1 parent 2d6b3a4 commit bf2445e

File tree

6 files changed

+296
-9
lines changed

6 files changed

+296
-9
lines changed

en/TOC.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
- [Suspend and Resume a TiDB Cluster](suspend-tidb-cluster.md)
4646
- [Restart a TiDB Cluster](restart-a-tidb-cluster.md)
4747
- [Destroy a TiDB Cluster](destroy-a-tidb-cluster.md)
48+
- [Maintain Kubernetes Nodes](maintain-a-kubernetes-node.md)
4849
- Reference
4950
- Architecture
5051
- [TiDB Operator](architecture.md)

en/maintain-a-kubernetes-node.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
---
2+
title: Maintain Kubernetes Nodes that Hold the TiDB Cluster
3+
summary: Learn how to maintain Kubernetes nodes that hold the TiDB cluster.
4+
---
5+
6+
# Maintain Kubernetes Nodes that Hold the TiDB Cluster
7+
8+
TiDB is a highly available database that can run smoothly when some of the database nodes go offline. For this reason, you can safely shut down and maintain the Kubernetes nodes at the bottom layer without influencing TiDB's service.
9+
10+
This document describes in detail how to perform maintenance operations on Kubernetes nodes. Different operation strategies are provided based on maintenance duration and storage type.
11+
12+
## Prerequisites
13+
14+
- [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
15+
16+
> **Note:**
17+
>
18+
> Before you maintain a node, you need to make sure that the remaining resources in the Kubernetes cluster are enough for running the TiDB cluster.
19+
20+
## Maintain a node
21+
22+
### Step 1: Preparation
23+
24+
1. Use the `kubectl cordon` command to mark the node to be maintained as unschedulable to prevent new Pods from being scheduled to the node:
25+
26+
```shell
27+
kubectl cordon ${node_name}
28+
```
29+
30+
2. Check if there are TiDB cluster component Pods on the node to be maintained:
31+
32+
```shell
33+
kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
34+
```
35+
36+
### Step 2: Migrate TiDB cluster component Pods
37+
38+
Choose the appropriate Pod migration strategy based on your storage type:
39+
40+
#### Option A: Reschedule Pods (for automatically migratable storage)
41+
42+
If the node storage can be automatically migrated (such as [Amazon EBS](https://aws.amazon.com/ebs/)), you can refer to [Gracefully restart a single Pod of a component](restart-a-tidb-cluster.md) to reschedule component Pods. Using the PD component as an example:
43+
44+
1. Check the PD Pods on the node to be maintained:
45+
46+
```shell
47+
kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
48+
```
49+
50+
2. Check the instance name corresponding to the PD Pod:
51+
52+
```shell
53+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
54+
```
55+
56+
3. Add a new label to the PD instance to trigger rescheduling:
57+
58+
```shell
59+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
60+
```
61+
62+
4. Confirm that the PD Pod has been successfully scheduled to other nodes:
63+
64+
```shell
65+
watch kubectl -n ${namespace} get pod -o wide
66+
```
67+
68+
5. Repeat the above steps for other components (TiKV, TiDB, etc.) until all TiDB cluster component Pods on the maintenance node have been migrated.
69+
70+
#### Option B: Recreate instances (for local storage)
71+
72+
If the node storage cannot be automatically migrated (such as local storage), you need to recreate instances:
73+
74+
> **Warning:**
75+
>
76+
> Recreating instances will cause data loss. For stateful components like TiKV, ensure that the cluster has sufficient replicas to guarantee data safety.
77+
78+
Using recreating a TiKV instance as an example:
79+
80+
1. Delete the TiKV instance CR. TiDB Operator will delete its associated PVC, ConfigMap, and other resources, and automatically create a new instance:
81+
82+
```shell
83+
kubectl delete -n ${namespace} tikv ${tikv_instance_name}
84+
```
85+
86+
2. Wait for the newly created TiKV instance status to become ready:
87+
88+
```shell
89+
kubectl get -n ${namespace} tikv ${tikv_instance_name}
90+
```
91+
92+
3. After confirming that the TiDB cluster status is normal and data synchronization is complete, you can continue to maintain other components.
93+
94+
### Step 3: Confirm migration completion
95+
96+
At this point, there should only be Pods managed by DaemonSets (such as network plugins, monitoring agents, etc.):
97+
98+
```shell
99+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
100+
```
101+
102+
### Step 4: Perform node maintenance
103+
104+
At this point, you can safely perform node maintenance operations (such as restart, system update, hardware maintenance, etc.).
105+
106+
### Step 5: Post-maintenance recovery (only for temporary maintenance)
107+
108+
If it is temporary maintenance, you need to restore the node after maintenance is completed:
109+
110+
1. Confirm the node health status:
111+
112+
```shell
113+
watch kubectl get node ${node_name}
114+
```
115+
116+
After observing that the node enters the `Ready` state, proceed to the next step.
117+
118+
2. Use the `kubectl uncordon` command to remove the node's scheduling restrictions:
119+
120+
```shell
121+
kubectl uncordon ${node_name}
122+
```
123+
124+
3. Observe whether all Pods have returned to normal operation:
125+
126+
```shell
127+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
128+
```
129+
130+
After the Pods return to normal operation, the maintenance operation is complete.
131+
132+
If it is long-term maintenance or permanent node removal, this step is not required.

en/restart-a-tidb-cluster.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,18 @@ For a TiKV Pod, specify the `--grace-period` option when deleting the Pod to pro
4040
kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
4141
```
4242

43-
For other component Pods, you can delete them directly, because TiDB Operator will automatically handle a graceful restart:
43+
For Pods of other components, you can perform a graceful restart by adding a label or annotation to the corresponding Instance CR. Taking PD as an example:
4444

45-
```shell
46-
kubectl -n ${namespace} delete pod ${pod_name}
47-
```
45+
1. First, query the corresponding PD Instance CR through the Pod:
46+
47+
```shell
48+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
49+
```
50+
51+
2. Add a new label to the PD instance, for example:
52+
53+
```shell
54+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
55+
```
56+
57+
3. If the PD is the leader, TiDB Operator will migrate the leader to another PD before restarting the PD Pod.

zh/TOC.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
- [挂起和恢复 TiDB 集群](suspend-tidb-cluster.md)
4646
- [重启 TiDB 集群](restart-a-tidb-cluster.md)
4747
- [销毁 TiDB 集群](destroy-a-tidb-cluster.md)
48+
- [维护 TiDB 集群所在的 Kubernetes 节点](maintain-a-kubernetes-node.md)
4849
- 参考
4950
- 架构
5051
- [TiDB Operator 架构](architecture.md)

zh/maintain-a-kubernetes-node.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
---
2+
title: 维护 TiDB 集群所在的 Kubernetes 节点
3+
summary: 介绍如何维护 TiDB 集群所在的 Kubernetes 节点。
4+
---
5+
6+
# 维护 TiDB 集群所在的 Kubernetes 节点
7+
8+
TiDB 是高可用数据库,可以在部分数据库节点下线的情况下正常运行,因此,我们可以安全地对底层 Kubernetes 节点进行停机维护。
9+
10+
本文档将详细介绍如何对 Kubernetes 节点进行维护操作。根据维护时长和存储类型,提供不同的操作策略。
11+
12+
## 环境准备
13+
14+
- [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
15+
16+
> **注意:**
17+
>
18+
> 维护节点前,需要保证 Kubernetes 集群的剩余资源足够运行 TiDB 集群。
19+
20+
21+
## 维护节点
22+
23+
### 步骤 1:准备工作
24+
25+
1. 使用 `kubectl cordon` 命令标记待维护节点为不可调度,防止新的 Pod 调度到待维护节点上:
26+
27+
```shell
28+
kubectl cordon ${node_name}
29+
```
30+
31+
2. 检查待维护节点上是否有 TiDB 集群组件 Pod:
32+
33+
```shell
34+
kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
35+
```
36+
37+
### 步骤 2:迁移 TiDB 集群组件 Pod
38+
39+
根据您的存储类型,选择合适的 Pod 迁移策略:
40+
41+
#### 选项 A:重调度 Pod(适用于存储可自动迁移)
42+
43+
如果使用的是可自动迁移的存储(如 [Amazon EBS](https://aws.amazon.com/cn/ebs/)),可以参考[优雅重启某个组件的单个 Pod](restart-a-tidb-cluster.md)来重调度各个组件 Pod。以 PD 组件为例:
44+
45+
1. 查看待维护节点上的 PD Pod:
46+
47+
```shell
48+
kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
49+
```
50+
51+
2. 查看该 PD Pod 对应的实例名称:
52+
53+
```shell
54+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
55+
```
56+
57+
3. 给该 PD 实例添加一个新的 label 来触发重调度:
58+
59+
```shell
60+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
61+
```
62+
63+
4. 确认该 PD Pod 已成功调度到其它节点:
64+
65+
```shell
66+
watch kubectl -n ${namespace} get pod -o wide
67+
```
68+
69+
5. 对其他组件(TiKV、TiDB 等)重复上述步骤,直到该维护节点上所有 TiDB 集群组件 Pod 都迁移完成。
70+
71+
#### 选项 B:重建实例(适用于本地存储)
72+
73+
如果节点存储不可以自动迁移(比如使用本地存储),你需要重建实例:
74+
75+
> **警告:**
76+
>
77+
> 重建实例会导致数据丢失。对于 TiKV 等有状态组件,请确保集群有足够的副本来保证数据安全。
78+
79+
以重建 TiKV 实例为例:
80+
81+
1. 删除 TiKV 实例 CR,TiDB Operator 会删除其关联的 PVC 和 ConfigMap 等资源,并自动创建新的实例:
82+
83+
```shell
84+
kubectl delete -n ${namespace} tikv ${tikv_instance_name}
85+
```
86+
87+
2. 等待新创建的 TiKV 实例状态变为就绪:
88+
89+
```shell
90+
kubectl get -n ${namespace} tikv ${tikv_instance_name}
91+
```
92+
93+
3. 确认 TiDB 集群状态正常,数据同步完成后,可以继续维护其他组件。
94+
95+
### 步骤 3:确认迁移完成
96+
97+
此时应该只剩下 DaemonSet 管理的 Pod(如网络插件、监控代理等):
98+
99+
```shell
100+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
101+
```
102+
103+
### 步骤 4:进行节点维护
104+
105+
此时可以安全地进行节点维护操作(如重启、更新系统、硬件维护等)。
106+
107+
### 步骤 5:维护后恢复(仅适用于临时维护)
108+
109+
如果是临时维护,节点维护完成后需要恢复节点:
110+
111+
1. 确认节点健康状态:
112+
113+
```shell
114+
watch kubectl get node ${node_name}
115+
```
116+
117+
观察到节点进入 `Ready` 状态后,继续下一步操作。
118+
119+
2. 使用 `kubectl uncordon` 命令解除节点的调度限制:
120+
121+
```shell
122+
kubectl uncordon ${node_name}
123+
```
124+
125+
3. 观察 Pod 是否全部恢复正常运行:
126+
127+
```shell
128+
kubectl get pod --all-namespaces -o wide | grep ${node_name}
129+
```
130+
131+
Pod 恢复正常运行后,维护操作完成。
132+
133+
如果是长期维护或节点永久移除,则不需要执行此步骤。

zh/restart-a-tidb-cluster.md

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,14 +34,24 @@ spec:
3434
3535
你可以单独重启 TiDB 集群中的特定 Pod。不同组件的 Pod,操作略有不同。
3636
37-
对于 TiKV Pod,为确保有足够时间驱逐 Region leader,在删除 Pod 时需要指定 `--grace-period` 选项,否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期:
37+
对于 TiKV Pod,为确保有足够时间驱逐 Region Leader,在删除 Pod 时需要指定 `--grace-period` 选项,否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期:
3838

3939
```shell
4040
kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
4141
```
4242

43-
其他组件的 Pod 可以直接删除,TiDB Operator 会自动优雅重启这些 Pod
43+
对于其他组件的 Pod,可以通过给 Pod 对应的实例(Instance CR)添加 label 或 annotation 的方式来优雅重启。以 PD 为例
4444

45-
```shell
46-
kubectl -n ${namespace} delete pod ${pod_name}
47-
```
45+
1. 首先,通过 Pod 查询到对应的 PD Instance CR:
46+
47+
```shell
48+
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
49+
```
50+
51+
2. 给该 PD 实例打上一个新 label,例如:
52+
53+
```shell
54+
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
55+
```
56+
57+
3. 若该 PD 是 Leader,TiDB Operator 会将 Leader 迁移给其他 PD 后再重启该 PD Pod。

0 commit comments

Comments
 (0)