Add a doc about maintaining a node (#2910)

fgksgf · web-flow · commit 7890a5ec715b · 2025-11-06T02:34:56.000Z
diff --git a/en/TOC.md b/en/TOC.md
@@ -45,6 +45,7 @@
     - [Suspend and Resume a TiDB Cluster](suspend-tidb-cluster.md)
     - [Restart a TiDB Cluster](restart-a-tidb-cluster.md)
     - [Destroy a TiDB Cluster](destroy-a-tidb-cluster.md)
+    - [Maintain Kubernetes Nodes](maintain-a-kubernetes-node.md)
 - Troubleshoot
   - [Troubleshooting Tips](tips.md)
   - [Deployment Failures](deploy-failures.md)
diff --git a/en/maintain-a-kubernetes-node.md b/en/maintain-a-kubernetes-node.md
@@ -0,0 +1,138 @@
+---
+title: Maintain Kubernetes Nodes That Hold the TiDB Cluster
+summary: Learn how to maintain Kubernetes nodes that hold the TiDB cluster.
+---
+
+# Maintain Kubernetes Nodes That Hold the TiDB Cluster
+
+TiDB is a highly available database that can run smoothly when some of the database nodes go offline. Therefore, you can safely shut down and maintain the Kubernetes nodes that host TiDB clusters.
+
+This document describes how to perform maintenance operations on Kubernetes nodes based on maintenance duration and storage type.
+
+## Prerequisites
+
+- Install [`kubectl`](https://kubernetes.io/docs/tasks/tools/).
+
+> **Note:**
+>
+> Before you maintain a node, make sure that the remaining resources in the Kubernetes cluster are enough for running the TiDB cluster.
+
+## Maintain a node
+
+### Step 1: Preparation
+
+1. Use the `kubectl cordon` command to mark the node to be maintained as unschedulable to prevent new Pods from being scheduled to this node:
+
+    ```shell
+    kubectl cordon ${node_name}
+    ```
+
+2. Check whether any TiDB cluster component Pods are running on the node to be maintained:
+
+    ```shell
+    kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
+    ```
+
+    - If the node has TiDB cluster component Pods, follow the subsequent steps in this document to migrate these Pods.  
+    - If the node does not have any TiDB cluster component Pods, there is no need to migrate Pods, and you can proceed directly with node maintenance.
+
+### Step 2: Migrate TiDB cluster component Pods
+
+Based on the storage type of the Kubernetes node, choose the corresponding Pod migration strategy:
+
+- **Automatically migratable storage**: use [Method 1: Reschedule Pods](#method-1-reschedule-pods-for-automatically-migratable-storage).
+- **Non-automatically migratable storage**: use [Method 2: Recreate instances](#method-2-recreate-instances-for-local-storage).
+
+#### Method 1: Reschedule Pods (for automatically migratable storage)
+
+If you use storage that supports automatic migration (such as [Amazon EBS](https://aws.amazon.com/ebs/)), you can reschedule component Pods by following [Perform a graceful restart of a single Pod in a component](restart-a-tidb-cluster.md#perform-a-graceful-restart-of-a-single-pod-in-a-component). The following instructions take rescheduling PD Pods as an example:
+
+1. Check the PD Pod on the node to be maintained:
+
+    ```shell
+    kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
+    ```
+
+2. Get the instance name of the PD Pod:
+
+    ```shell
+    kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
+    ```
+
+3. Add a new label to the PD instance to trigger rescheduling:
+
+    ```shell
+    kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
+    ```
+
+4. Confirm that the PD Pod is successfully scheduled to another node:
+
+    ```shell
+    watch kubectl -n ${namespace} get pod -o wide
+    ```
+
+5. Follow the same steps to migrate Pods of other components such as TiKV and TiDB until all TiDB cluster component Pods on the node are migrated.
+
+#### Method 2: Recreate instances (for local storage)
+
+If the node uses storage that cannot be automatically migrated (such as local storage), you need to recreate instances.
+
+> **Warning:**
+>
+> Recreating instances causes data loss. For stateful components such as TiKV, ensure that the cluster has sufficient replicas to guarantee data safety.
+
+The following instructions take recreating a TiKV instance as an example:
+
+1. Delete the CR of the TiKV instance. TiDB Operator automatically deletes the associated PVC and ConfigMap resources, and creates a new instance:
+
+    ```shell
+    kubectl delete -n ${namespace} tikv ${tikv_instance_name}
+    ```
+
+2. Wait for the status of the newly created TiKV instance to become `Ready`:
+
+    ```shell
+    kubectl get -n ${namespace} tikv ${tikv_instance_name}
+    ```
+
+3. After you confirm that the TiDB cluster status is normal and data synchronization is completed, continue to maintain other components.
+
+### Step 3: Confirm migration completion
+
+After you complete Pod migration, only the Pods managed by DaemonSet (such as network plugins and monitoring agents) should be running on the node:
+
+```shell
+kubectl get pod --all-namespaces -o wide | grep ${node_name}
+```
+
+### Step 4: Perform node maintenance
+
+You can now safely perform maintenance operations on the node, such as restarting, updating the operating system, or performing hardware maintenance.
+
+### Step 5: Recover after maintenance (for temporary maintenance only)
+
+If you plan to perform long-term maintenance or permanently take the node offline, skip this step.
+
+For temporary maintenance, perform the following recovery operations after the node maintenance is completed:
+
+1. Check the node health status:
+
+    ```shell
+    watch kubectl get node ${node_name}
+    ```
+
+    When the node status becomes `Ready`, continue to the next step.
+
+2. Use the `kubectl uncordon` command to remove the scheduling restriction on the node:
+
+    ```shell
+    kubectl uncordon ${node_name}
+    ```
+
+3. Check whether all Pods are running normally:
+
+    ```shell
+    kubectl get pod --all-namespaces -o wide | grep ${node_name}
+    ```
+
+    When all Pods are running normally, the maintenance operation is completed.
diff --git a/en/restart-a-tidb-cluster.md b/en/restart-a-tidb-cluster.md
@@ -40,8 +40,18 @@ For a TiKV Pod, specify the `--grace-period` option when deleting the Pod to pro
 kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
 ```
 
-For other component Pods, you can delete them directly, because TiDB Operator will automatically handle a graceful restart:
+For Pods of other components, you can perform a graceful restart by adding a label or annotation to its corresponding Instance CR. The following uses the PD component as an example:
 
-```shell
-kubectl -n ${namespace} delete pod ${pod_name}
-```
+1. Query the PD Instance CR from the Pod:
+
+    ```shell
+    kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
+    ```
+
+2. Add a new label to the PD instance to trigger a restart. For example:
+
+    ```shell
+    kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
+    ```
+
+3. If this PD instance is the leader, TiDB Operator first transfers the leader role to another PD instance and then restarts the Pod.
diff --git a/zh/TOC.md b/zh/TOC.md
@@ -45,6 +45,7 @@
     - [挂起和恢复 TiDB 集群](suspend-tidb-cluster.md)
     - [重启 TiDB 集群](restart-a-tidb-cluster.md)
     - [销毁 TiDB 集群](destroy-a-tidb-cluster.md)
+    - [维护 TiDB 集群所在的 Kubernetes 节点](maintain-a-kubernetes-node.md)
 - 故障诊断
   - [使用技巧](tips.md)
   - [部署错误](deploy-failures.md)
diff --git a/zh/maintain-a-kubernetes-node.md b/zh/maintain-a-kubernetes-node.md
@@ -0,0 +1,138 @@
+---
+title: 维护 TiDB 集群所在的 Kubernetes 节点
+summary: 介绍如何维护 TiDB 集群所在的 Kubernetes 节点。
+---
+
+# 维护 TiDB 集群所在的 Kubernetes 节点
+
+TiDB 是高可用数据库，即使部分节点下线，集群也能正常运行。因此，你可以安全地对 TiDB 集群所在的 Kubernetes 节点执行停机维护操作。
+
+本文介绍在不同存储类型和维护时长下，如何安全地维护 Kubernetes 节点。
+
+## 前提条件
+
+- 安装 [`kubectl`](https://kubernetes.io/zh-cn/docs/tasks/tools/)
+
+> **注意：**
+>
+> 维护节点前，请确保 Kubernetes 集群的剩余资源足以支撑 TiDB 集群的正常运行。
+
+## 维护节点步骤
+
+### 第 1 步：准备工作
+
+1. 使用 `kubectl cordon` 命令将待维护节点标记为不可调度，防止新的 Pod 被调度到该节点：
+
+    ```shell
+    kubectl cordon ${node_name}
+    ```
+
+2. 检查待维护节点上是否运行 TiDB 集群组件的 Pod：
+
+    ```shell
+    kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
+    ```
+
+    - 如果节点上存在 TiDB 集群组件的 Pod，请按照后续步骤迁移这些 Pod。
+    - 如果节点上没有 TiDB 集群组件的 Pod，则无需迁移 Pod，可直接进行节点维护。
+    
+### 第 2 步：迁移 TiDB 集群组件 Pod
+
+根据 Kubernetes 节点的存储类型，选择相应的 Pod 迁移策略：
+
+- **可自动迁移存储**：使用[方法 1：重调度 Pod](#方法-1重调度-pod适用于可自动迁移的存储)
+- **不可自动迁移存储**：使用[方法 2：重建实例](#方法-2重建实例适用于本地存储)
+
+#### 方法 1：重调度 Pod（适用于可自动迁移的存储）
+
+如果 Kubernetes 节点使用的存储支持自动迁移（如 [Amazon EBS](https://aws.amazon.com/cn/ebs/)），可以通过[优雅重启某个组件的单个 Pod](restart-a-tidb-cluster.md#优雅重启某个组件的单个-pod) 的方式重调度各个组件 Pod。以 PD 组件为例：
+
+1. 查看待维护节点上的 PD Pod：
+
+    ```shell
+    kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
+    ```
+
+2. 获取该 PD Pod 对应的实例名称：
+
+    ```shell
+    kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
+    ```
+
+3. 为该 PD 实例添加一个新标签以触发重调度：
+
+    ```shell
+    kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
+    ```
+
+4. 确认该 PD Pod 已成功调度到其他节点：
+
+    ```shell
+    watch kubectl -n ${namespace} get pod -o wide
+    ```
+
+5. 按相同步骤迁移 TiKV、TiDB 等其他组件 Pod，直至该维护节点上的所有 TiDB 集群组件 Pod 都迁移完成。
+
+#### 方法 2：重建实例（适用于本地存储）
+
+如果 Kubernetes 节点使用的存储不支持自动迁移（如本地存储），你需要重建实例。
+
+> **警告：**
+>
+> 重建实例会导致数据丢失。对于 TiKV 等有状态组件，请确保集群副本数充足，以保障数据安全。
+
+以重建 TiKV 实例为例：
+
+1. 删除 TiKV 实例的 CR。TiDB Operator 会自动删除其关联的 PVC 和 ConfigMap 等资源，并创建新实例：
+
+    ```shell
+    kubectl delete -n ${namespace} tikv ${tikv_instance_name}
+    ```
+
+2. 等待新创建的 TiKV 实例状态变为 `Ready`：
+
+    ```shell
+    kubectl get -n ${namespace} tikv ${tikv_instance_name}
+    ```
+
+3. 确认 TiDB 集群状态正常且数据同步完成后，再继续维护其他组件。
+
+### 第 3 步：确认迁移完成
+
+完成 Pod 迁移后，该节点上应仅运行由 DaemonSet 管理的 Pod（如网络插件、监控代理等）：
+
+```shell
+kubectl get pod --all-namespaces -o wide | grep ${node_name}
+```
+
+### 第 4 步：执行节点维护
+
+现在，你可以安全地对节点执行维护操作，例如重启、更新操作系统或进行硬件维护。
+
+### 第 5 步：维护后恢复（仅适用于临时维护）
+
+如果计划长期维护或永久下线节点，请跳过此步骤。
+
+对于临时维护，节点维护完成后需要执行以下恢复操作：
+
+1. 确认节点健康状态：
+
+    ```shell
+    watch kubectl get node ${node_name}
+    ```
+
+    当节点状态变为 `Ready` 后，继续下一步。
+
+2. 使用 `kubectl uncordon` 命令解除节点的调度限制：
+
+    ```shell
+    kubectl uncordon ${node_name}
+    ```
+
+3. 观察 Pod 是否全部恢复正常运行：
+
+    ```shell
+    kubectl get pod --all-namespaces -o wide | grep ${node_name}
+    ```
+
+    当所有 Pod 正常运行后，维护操作完成。
diff --git a/zh/restart-a-tidb-cluster.md b/zh/restart-a-tidb-cluster.md
@@ -34,14 +34,24 @@ spec:
 
 你可以单独重启 TiDB 集群中的特定 Pod。不同组件的 Pod，操作略有不同。
 
-对于 TiKV Pod，为确保有足够时间驱逐 Region leader，在删除 Pod 时需要指定 `--grace-period` 选项，否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期：
+对于 TiKV Pod，为确保有足够时间驱逐 Region Leader，在删除 Pod 时需要指定 `--grace-period` 选项，否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期：
 
 ```shell
 kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
 ```
 
-其他组件的 Pod 可以直接删除，TiDB Operator 会自动优雅重启这些 Pod：
+对于其他组件的 Pod，可以通过给 Pod 对应的实例 (Instance CR) 添加标签或注解的方式实现优雅重启。以 PD 为例：
 
-```shell
-kubectl -n ${namespace} delete pod ${pod_name}
-```
+1. 根据 Pod 查询对应的 PD Instance CR：
+
+    ```shell
+    kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
+    ```
+
+2. 给该 PD 实例添加新标签以触发重启，例如：
+
+    ```shell
+    kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
+    ```
+
+3. 如果该 PD 为 Leader，TiDB Operator 会先将 Leader 迁移到其他 PD，再重启该 Pod。