-
Notifications
You must be signed in to change notification settings - Fork 125
Add a doc about maintaining a node #2910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| --- | ||
| title: Maintain Kubernetes Nodes That Hold the TiDB Cluster | ||
| summary: Learn how to maintain Kubernetes nodes that hold the TiDB cluster. | ||
| --- | ||
|
|
||
| # Maintain Kubernetes Nodes That Hold the TiDB Cluster | ||
|
|
||
| TiDB is a highly available database that can run smoothly when some of the database nodes go offline. Therefore, you can safely shut down and maintain the Kubernetes nodes that host TiDB clusters. | ||
|
|
||
| This document describes how to perform maintenance operations on Kubernetes nodes based on maintenance duration and storage type. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Install [`kubectl`](https://kubernetes.io/docs/tasks/tools/). | ||
|
|
||
| > **Note:** | ||
| > | ||
| > Before you maintain a node, make sure that the remaining resources in the Kubernetes cluster are enough for running the TiDB cluster. | ||
|
|
||
| ## Maintain a node | ||
|
|
||
| ### Step 1: Preparation | ||
|
|
||
| 1. Use the `kubectl cordon` command to mark the node to be maintained as unschedulable to prevent new Pods from being scheduled to this node: | ||
|
|
||
| ```shell | ||
| kubectl cordon ${node_name} | ||
| ``` | ||
|
|
||
| 2. Check whether any TiDB cluster component Pods are running on the node to be maintained: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name} | ||
| ``` | ||
|
|
||
| - If the node has TiDB cluster component Pods, follow the subsequent steps in this document to migrate these Pods. | ||
| - If the node does not have any TiDB cluster component Pods, there is no need to migrate Pods, and you can proceed directly with node maintenance. | ||
|
|
||
| ### Step 2: Migrate TiDB cluster component Pods | ||
|
|
||
| Based on the storage type of the Kubernetes node, choose the corresponding Pod migration strategy: | ||
|
|
||
| - **Automatically migratable storage**: use [Method 1: Reschedule Pods](#method-1-reschedule-pods-for-automatically-migratable-storage). | ||
| - **Non-automatically migratable storage**: use [Method 2: Recreate instances](#method-2-recreate-instances-for-local-storage). | ||
|
|
||
| #### Method 1: Reschedule Pods (for automatically migratable storage) | ||
|
|
||
| If you use storage that supports automatic migration (such as [Amazon EBS](https://aws.amazon.com/ebs/)), you can reschedule component Pods by following [Perform a graceful restart of a single Pod in a component](restart-a-tidb-cluster.md#perform-a-graceful-restart-of-a-single-pod-in-a-component). The following instructions take rescheduling PD Pods as an example: | ||
|
|
||
| 1. Check the PD Pod on the node to be maintained: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name} | ||
| ``` | ||
|
|
||
| 2. Get the instance name of the PD Pod: | ||
|
|
||
| ```shell | ||
| kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}' | ||
| ``` | ||
|
|
||
| 3. Add a new label to the PD instance to trigger rescheduling: | ||
|
|
||
| ```shell | ||
| kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00 | ||
| ``` | ||
|
|
||
| 4. Confirm that the PD Pod is successfully scheduled to another node: | ||
|
|
||
| ```shell | ||
| watch kubectl -n ${namespace} get pod -o wide | ||
| ``` | ||
|
|
||
| 5. Follow the same steps to migrate Pods of other components such as TiKV and TiDB until all TiDB cluster component Pods on the node are migrated. | ||
|
|
||
| #### Method 2: Recreate instances (for local storage) | ||
|
|
||
| If the node uses storage that cannot be automatically migrated (such as local storage), you need to recreate instances. | ||
|
|
||
| > **Warning:** | ||
| > | ||
| > Recreating instances causes data loss. For stateful components such as TiKV, ensure that the cluster has sufficient replicas to guarantee data safety. | ||
|
|
||
| The following instructions take recreating a TiKV instance as an example: | ||
|
|
||
| 1. Delete the CR of the TiKV instance. TiDB Operator automatically deletes the associated PVC and ConfigMap resources, and creates a new instance: | ||
|
|
||
| ```shell | ||
| kubectl delete -n ${namespace} tikv ${tikv_instance_name} | ||
| ``` | ||
|
|
||
| 2. Wait for the status of the newly created TiKV instance to become `Ready`: | ||
|
|
||
| ```shell | ||
| kubectl get -n ${namespace} tikv ${tikv_instance_name} | ||
| ``` | ||
|
|
||
| 3. After you confirm that the TiDB cluster status is normal and data synchronization is completed, continue to maintain other components. | ||
|
|
||
| ### Step 3: Confirm migration completion | ||
|
|
||
| After you complete Pod migration, only the Pods managed by DaemonSet (such as network plugins and monitoring agents) should be running on the node: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide | grep ${node_name} | ||
| ``` | ||
|
|
||
| ### Step 4: Perform node maintenance | ||
|
|
||
| You can now safely perform maintenance operations on the node, such as restarting, updating the operating system, or performing hardware maintenance. | ||
|
|
||
| ### Step 5: Recover after maintenance (for temporary maintenance only) | ||
|
|
||
| If you plan to perform long-term maintenance or permanently take the node offline, skip this step. | ||
|
|
||
| For temporary maintenance, perform the following recovery operations after the node maintenance is completed: | ||
|
|
||
| 1. Check the node health status: | ||
|
|
||
| ```shell | ||
| watch kubectl get node ${node_name} | ||
| ``` | ||
|
|
||
| When the node status becomes `Ready`, continue to the next step. | ||
|
|
||
| 2. Use the `kubectl uncordon` command to remove the scheduling restriction on the node: | ||
|
|
||
| ```shell | ||
| kubectl uncordon ${node_name} | ||
| ``` | ||
|
|
||
| 3. Check whether all Pods are running normally: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide | grep ${node_name} | ||
| ``` | ||
|
|
||
| When all Pods are running normally, the maintenance operation is completed. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| --- | ||
| title: 维护 TiDB 集群所在的 Kubernetes 节点 | ||
| summary: 介绍如何维护 TiDB 集群所在的 Kubernetes 节点。 | ||
| --- | ||
|
|
||
| # 维护 TiDB 集群所在的 Kubernetes 节点 | ||
|
|
||
| TiDB 是高可用数据库,即使部分节点下线,集群也能正常运行。因此,你可以安全地对 TiDB 集群所在的 Kubernetes 节点执行停机维护操作。 | ||
|
|
||
| 本文介绍在不同存储类型和维护时长下,如何安全地维护 Kubernetes 节点。 | ||
|
|
||
| ## 前提条件 | ||
|
|
||
| - 安装 [`kubectl`](https://kubernetes.io/zh-cn/docs/tasks/tools/) | ||
|
|
||
| > **注意:** | ||
| > | ||
| > 维护节点前,请确保 Kubernetes 集群的剩余资源足以支撑 TiDB 集群的正常运行。 | ||
|
|
||
| ## 维护节点步骤 | ||
|
|
||
| ### 第 1 步:准备工作 | ||
|
|
||
| 1. 使用 `kubectl cordon` 命令将待维护节点标记为不可调度,防止新的 Pod 被调度到该节点: | ||
|
|
||
| ```shell | ||
| kubectl cordon ${node_name} | ||
| ``` | ||
|
|
||
| 2. 检查待维护节点上是否运行 TiDB 集群组件的 Pod: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name} | ||
| ``` | ||
|
|
||
|
fgksgf marked this conversation as resolved.
|
||
| - 如果节点上存在 TiDB 集群组件的 Pod,请按照后续步骤迁移这些 Pod。 | ||
| - 如果节点上没有 TiDB 集群组件的 Pod,则无需迁移 Pod,可直接进行节点维护。 | ||
|
|
||
| ### 第 2 步:迁移 TiDB 集群组件 Pod | ||
|
|
||
| 根据 Kubernetes 节点的存储类型,选择相应的 Pod 迁移策略: | ||
|
|
||
| - **可自动迁移存储**:使用[方法 1:重调度 Pod](#方法-1重调度-pod适用于可自动迁移的存储) | ||
| - **不可自动迁移存储**:使用[方法 2:重建实例](#方法-2重建实例适用于本地存储) | ||
|
|
||
| #### 方法 1:重调度 Pod(适用于可自动迁移的存储) | ||
|
|
||
| 如果 Kubernetes 节点使用的存储支持自动迁移(如 [Amazon EBS](https://aws.amazon.com/cn/ebs/)),可以通过[优雅重启某个组件的单个 Pod](restart-a-tidb-cluster.md#优雅重启某个组件的单个-pod) 的方式重调度各个组件 Pod。以 PD 组件为例: | ||
|
|
||
| 1. 查看待维护节点上的 PD Pod: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name} | ||
| ``` | ||
|
|
||
| 2. 获取该 PD Pod 对应的实例名称: | ||
|
|
||
| ```shell | ||
| kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}' | ||
| ``` | ||
|
|
||
| 3. 为该 PD 实例添加一个新标签以触发重调度: | ||
|
|
||
| ```shell | ||
| kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00 | ||
| ``` | ||
|
|
||
| 4. 确认该 PD Pod 已成功调度到其他节点: | ||
|
|
||
| ```shell | ||
| watch kubectl -n ${namespace} get pod -o wide | ||
| ``` | ||
|
|
||
| 5. 按相同步骤迁移 TiKV、TiDB 等其他组件 Pod,直至该维护节点上的所有 TiDB 集群组件 Pod 都迁移完成。 | ||
|
|
||
| #### 方法 2:重建实例(适用于本地存储) | ||
|
|
||
| 如果 Kubernetes 节点使用的存储不支持自动迁移(如本地存储),你需要重建实例。 | ||
|
|
||
| > **警告:** | ||
| > | ||
| > 重建实例会导致数据丢失。对于 TiKV 等有状态组件,请确保集群副本数充足,以保障数据安全。 | ||
|
|
||
| 以重建 TiKV 实例为例: | ||
|
|
||
| 1. 删除 TiKV 实例的 CR。TiDB Operator 会自动删除其关联的 PVC 和 ConfigMap 等资源,并创建新实例: | ||
|
|
||
| ```shell | ||
| kubectl delete -n ${namespace} tikv ${tikv_instance_name} | ||
| ``` | ||
|
|
||
| 2. 等待新创建的 TiKV 实例状态变为 `Ready`: | ||
|
|
||
| ```shell | ||
| kubectl get -n ${namespace} tikv ${tikv_instance_name} | ||
| ``` | ||
|
|
||
| 3. 确认 TiDB 集群状态正常且数据同步完成后,再继续维护其他组件。 | ||
|
|
||
| ### 第 3 步:确认迁移完成 | ||
|
|
||
| 完成 Pod 迁移后,该节点上应仅运行由 DaemonSet 管理的 Pod(如网络插件、监控代理等): | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide | grep ${node_name} | ||
| ``` | ||
|
|
||
| ### 第 4 步:执行节点维护 | ||
|
|
||
| 现在,你可以安全地对节点执行维护操作,例如重启、更新操作系统或进行硬件维护。 | ||
|
|
||
| ### 第 5 步:维护后恢复(仅适用于临时维护) | ||
|
|
||
| 如果计划长期维护或永久下线节点,请跳过此步骤。 | ||
|
|
||
| 对于临时维护,节点维护完成后需要执行以下恢复操作: | ||
|
|
||
| 1. 确认节点健康状态: | ||
|
|
||
| ```shell | ||
| watch kubectl get node ${node_name} | ||
| ``` | ||
|
|
||
| 当节点状态变为 `Ready` 后,继续下一步。 | ||
|
|
||
| 2. 使用 `kubectl uncordon` 命令解除节点的调度限制: | ||
|
|
||
| ```shell | ||
| kubectl uncordon ${node_name} | ||
| ``` | ||
|
|
||
| 3. 观察 Pod 是否全部恢复正常运行: | ||
|
|
||
| ```shell | ||
| kubectl get pod --all-namespaces -o wide | grep ${node_name} | ||
| ``` | ||
|
|
||
| 当所有 Pod 正常运行后,维护操作完成。 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.