Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions en/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
- [Suspend and Resume a TiDB Cluster](suspend-tidb-cluster.md)
- [Restart a TiDB Cluster](restart-a-tidb-cluster.md)
- [Destroy a TiDB Cluster](destroy-a-tidb-cluster.md)
- [Maintain Kubernetes Nodes](maintain-a-kubernetes-node.md)
- Troubleshoot
- [Deployment Failures](deploy-failures.md)
- [Cluster Exceptions](exceptions.md)
Expand Down
138 changes: 138 additions & 0 deletions en/maintain-a-kubernetes-node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
title: Maintain Kubernetes Nodes That Hold the TiDB Cluster
summary: Learn how to maintain Kubernetes nodes that hold the TiDB cluster.
---

# Maintain Kubernetes Nodes That Hold the TiDB Cluster

TiDB is a highly available database that can run smoothly when some of the database nodes go offline. Therefore, you can safely shut down and maintain the Kubernetes nodes that host TiDB clusters.

This document describes how to perform maintenance operations on Kubernetes nodes based on maintenance duration and storage type.

## Prerequisites

- Install [`kubectl`](https://kubernetes.io/docs/tasks/tools/).

> **Note:**
>
> Before you maintain a node, make sure that the remaining resources in the Kubernetes cluster are enough for running the TiDB cluster.

## Maintain a node

### Step 1: Preparation

1. Use the `kubectl cordon` command to mark the node to be maintained as unschedulable to prevent new Pods from being scheduled to this node:

```shell
kubectl cordon ${node_name}
```

2. Check whether any TiDB cluster component Pods are running on the node to be maintained:

```shell
kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
```

Comment thread
fgksgf marked this conversation as resolved.
- If the node has TiDB cluster component Pods, follow the subsequent steps in this document to migrate these Pods.
- If the node does not have any TiDB cluster component Pods, there is no need to migrate Pods, and you can proceed directly with node maintenance.

### Step 2: Migrate TiDB cluster component Pods

Based on the storage type of the Kubernetes node, choose the corresponding Pod migration strategy:

- **Automatically migratable storage**: use [Method 1: Reschedule Pods](#method-1-reschedule-pods-for-automatically-migratable-storage).
- **Non-automatically migratable storage**: use [Method 2: Recreate instances](#method-2-recreate-instances-for-local-storage).

#### Method 1: Reschedule Pods (for automatically migratable storage)

If you use storage that supports automatic migration (such as [Amazon EBS](https://aws.amazon.com/ebs/)), you can reschedule component Pods by following [Perform a graceful restart of a single Pod in a component](restart-a-tidb-cluster.md#perform-a-graceful-restart-of-a-single-pod-in-a-component). The following instructions take rescheduling PD Pods as an example:

1. Check the PD Pod on the node to be maintained:

```shell
kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
```

2. Get the instance name of the PD Pod:

```shell
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
```

3. Add a new label to the PD instance to trigger rescheduling:

```shell
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
```

4. Confirm that the PD Pod is successfully scheduled to another node:

```shell
watch kubectl -n ${namespace} get pod -o wide
```

5. Follow the same steps to migrate Pods of other components such as TiKV and TiDB until all TiDB cluster component Pods on the node are migrated.

#### Method 2: Recreate instances (for local storage)

If the node uses storage that cannot be automatically migrated (such as local storage), you need to recreate instances.

> **Warning:**
>
> Recreating instances causes data loss. For stateful components such as TiKV, ensure that the cluster has sufficient replicas to guarantee data safety.

The following instructions take recreating a TiKV instance as an example:

1. Delete the CR of the TiKV instance. TiDB Operator automatically deletes the associated PVC and ConfigMap resources, and creates a new instance:

```shell
kubectl delete -n ${namespace} tikv ${tikv_instance_name}
```

2. Wait for the status of the newly created TiKV instance to become `Ready`:

```shell
kubectl get -n ${namespace} tikv ${tikv_instance_name}
```

3. After you confirm that the TiDB cluster status is normal and data synchronization is completed, continue to maintain other components.

### Step 3: Confirm migration completion

After you complete Pod migration, only the Pods managed by DaemonSet (such as network plugins and monitoring agents) should be running on the node:

```shell
kubectl get pod --all-namespaces -o wide | grep ${node_name}
```

### Step 4: Perform node maintenance

You can now safely perform maintenance operations on the node, such as restarting, updating the operating system, or performing hardware maintenance.

### Step 5: Recover after maintenance (for temporary maintenance only)

If you plan to perform long-term maintenance or permanently take the node offline, skip this step.

For temporary maintenance, perform the following recovery operations after the node maintenance is completed:

1. Check the node health status:

```shell
watch kubectl get node ${node_name}
```

When the node status becomes `Ready`, continue to the next step.

2. Use the `kubectl uncordon` command to remove the scheduling restriction on the node:

```shell
kubectl uncordon ${node_name}
```

3. Check whether all Pods are running normally:

```shell
kubectl get pod --all-namespaces -o wide | grep ${node_name}
```

When all Pods are running normally, the maintenance operation is completed.
18 changes: 14 additions & 4 deletions en/restart-a-tidb-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,18 @@ For a TiKV Pod, specify the `--grace-period` option when deleting the Pod to pro
kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
```

For other component Pods, you can delete them directly, because TiDB Operator will automatically handle a graceful restart:
For Pods of other components, you can perform a graceful restart by adding a label or annotation to its corresponding Instance CR. The following uses the PD component as an example:

```shell
kubectl -n ${namespace} delete pod ${pod_name}
```
1. Query the PD Instance CR from the Pod:

```shell
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
```

2. Add a new label to the PD instance to trigger a restart. For example:

```shell
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
```

3. If this PD instance is the leader, TiDB Operator first transfers the leader role to another PD instance and then restarts the Pod.
1 change: 1 addition & 0 deletions zh/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
- [挂起和恢复 TiDB 集群](suspend-tidb-cluster.md)
- [重启 TiDB 集群](restart-a-tidb-cluster.md)
- [销毁 TiDB 集群](destroy-a-tidb-cluster.md)
- [维护 TiDB 集群所在的 Kubernetes 节点](maintain-a-kubernetes-node.md)
- 故障诊断
- [部署错误](deploy-failures.md)
- [集群异常](exceptions.md)
Expand Down
138 changes: 138 additions & 0 deletions zh/maintain-a-kubernetes-node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
title: 维护 TiDB 集群所在的 Kubernetes 节点
summary: 介绍如何维护 TiDB 集群所在的 Kubernetes 节点。
---

# 维护 TiDB 集群所在的 Kubernetes 节点

TiDB 是高可用数据库,即使部分节点下线,集群也能正常运行。因此,你可以安全地对 TiDB 集群所在的 Kubernetes 节点执行停机维护操作。

本文介绍在不同存储类型和维护时长下,如何安全地维护 Kubernetes 节点。

## 前提条件

- 安装 [`kubectl`](https://kubernetes.io/zh-cn/docs/tasks/tools/)

> **注意:**
>
> 维护节点前,请确保 Kubernetes 集群的剩余资源足以支撑 TiDB 集群的正常运行。

## 维护节点步骤

### 第 1 步:准备工作

1. 使用 `kubectl cordon` 命令将待维护节点标记为不可调度,防止新的 Pod 被调度到该节点:

```shell
kubectl cordon ${node_name}
```

2. 检查待维护节点上是否运行 TiDB 集群组件的 Pod:

```shell
kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name}
```

Comment thread
fgksgf marked this conversation as resolved.
- 如果节点上存在 TiDB 集群组件的 Pod,请按照后续步骤迁移这些 Pod。
- 如果节点上没有 TiDB 集群组件的 Pod,则无需迁移 Pod,可直接进行节点维护。

### 第 2 步:迁移 TiDB 集群组件 Pod

根据 Kubernetes 节点的存储类型,选择相应的 Pod 迁移策略:

- **可自动迁移存储**:使用[方法 1:重调度 Pod](#方法-1重调度-pod适用于可自动迁移的存储)
- **不可自动迁移存储**:使用[方法 2:重建实例](#方法-2重建实例适用于本地存储)

#### 方法 1:重调度 Pod(适用于可自动迁移的存储)

如果 Kubernetes 节点使用的存储支持自动迁移(如 [Amazon EBS](https://aws.amazon.com/cn/ebs/)),可以通过[优雅重启某个组件的单个 Pod](restart-a-tidb-cluster.md#优雅重启某个组件的单个-pod) 的方式重调度各个组件 Pod。以 PD 组件为例:

1. 查看待维护节点上的 PD Pod:

```shell
kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name}
```

2. 获取该 PD Pod 对应的实例名称:

```shell
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
```

3. 为该 PD 实例添加一个新标签以触发重调度:

```shell
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
```

4. 确认该 PD Pod 已成功调度到其他节点:

```shell
watch kubectl -n ${namespace} get pod -o wide
```

5. 按相同步骤迁移 TiKV、TiDB 等其他组件 Pod,直至该维护节点上的所有 TiDB 集群组件 Pod 都迁移完成。

#### 方法 2:重建实例(适用于本地存储)

如果 Kubernetes 节点使用的存储不支持自动迁移(如本地存储),你需要重建实例。

> **警告:**
>
> 重建实例会导致数据丢失。对于 TiKV 等有状态组件,请确保集群副本数充足,以保障数据安全。

以重建 TiKV 实例为例:

1. 删除 TiKV 实例的 CR。TiDB Operator 会自动删除其关联的 PVC 和 ConfigMap 等资源,并创建新实例:

```shell
kubectl delete -n ${namespace} tikv ${tikv_instance_name}
```

2. 等待新创建的 TiKV 实例状态变为 `Ready`:

```shell
kubectl get -n ${namespace} tikv ${tikv_instance_name}
```

3. 确认 TiDB 集群状态正常且数据同步完成后,再继续维护其他组件。

### 第 3 步:确认迁移完成

完成 Pod 迁移后,该节点上应仅运行由 DaemonSet 管理的 Pod(如网络插件、监控代理等):

```shell
kubectl get pod --all-namespaces -o wide | grep ${node_name}
```

### 第 4 步:执行节点维护

现在,你可以安全地对节点执行维护操作,例如重启、更新操作系统或进行硬件维护。

### 第 5 步:维护后恢复(仅适用于临时维护)

如果计划长期维护或永久下线节点,请跳过此步骤。

对于临时维护,节点维护完成后需要执行以下恢复操作:

1. 确认节点健康状态:

```shell
watch kubectl get node ${node_name}
```

当节点状态变为 `Ready` 后,继续下一步。

2. 使用 `kubectl uncordon` 命令解除节点的调度限制:

```shell
kubectl uncordon ${node_name}
```

3. 观察 Pod 是否全部恢复正常运行:

```shell
kubectl get pod --all-namespaces -o wide | grep ${node_name}
```

当所有 Pod 正常运行后,维护操作完成。
20 changes: 15 additions & 5 deletions zh/restart-a-tidb-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,24 @@ spec:

你可以单独重启 TiDB 集群中的特定 Pod。不同组件的 Pod,操作略有不同。

对于 TiKV Pod,为确保有足够时间驱逐 Region leader,在删除 Pod 时需要指定 `--grace-period` 选项,否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期:
对于 TiKV Pod,为确保有足够时间驱逐 Region Leader,在删除 Pod 时需要指定 `--grace-period` 选项,否则操作可能失败。以下示例为 TiKV Pod 设置了 60 秒的宽限期:

```shell
kubectl -n ${namespace} delete pod ${pod_name} --grace-period=60
```

其他组件的 Pod 可以直接删除,TiDB Operator 会自动优雅重启这些 Pod
对于其他组件的 Pod,可以通过给 Pod 对应的实例 (Instance CR) 添加标签或注解的方式实现优雅重启。以 PD 为例

```shell
kubectl -n ${namespace} delete pod ${pod_name}
```
1. 根据 Pod 查询对应的 PD Instance CR:

```shell
kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}'
```

2. 给该 PD 实例添加新标签以触发重启,例如:

```shell
kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00
```

3. 如果该 PD 为 Leader,TiDB Operator 会先将 Leader 迁移到其他 PD,再重启该 Pod。