|
| 1 | +--- |
| 2 | +title: Maintain Kubernetes Nodes that Hold the TiDB Cluster |
| 3 | +summary: Learn how to maintain Kubernetes nodes that hold the TiDB cluster. |
| 4 | +--- |
| 5 | + |
| 6 | +# Maintain Kubernetes Nodes that Hold the TiDB Cluster |
| 7 | + |
| 8 | +TiDB is a highly available database that can run smoothly when some of the database nodes go offline. For this reason, you can safely shut down and maintain the Kubernetes nodes at the bottom layer without influencing TiDB's service. |
| 9 | + |
| 10 | +This document describes in detail how to perform maintenance operations on Kubernetes nodes. Different operation strategies are provided based on maintenance duration and storage type. |
| 11 | + |
| 12 | +## Prerequisites |
| 13 | + |
| 14 | +- [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/) |
| 15 | + |
| 16 | +> **Note:** |
| 17 | +> |
| 18 | +> Before you maintain a node, you need to make sure that the remaining resources in the Kubernetes cluster are enough for running the TiDB cluster. |
| 19 | +
|
| 20 | +## Maintain a node |
| 21 | + |
| 22 | +### Step 1: Preparation |
| 23 | + |
| 24 | +1. Use the `kubectl cordon` command to mark the node to be maintained as unschedulable to prevent new Pods from being scheduled to the node: |
| 25 | + |
| 26 | + ```shell |
| 27 | + kubectl cordon ${node_name} |
| 28 | + ``` |
| 29 | + |
| 30 | +2. Check if there are TiDB cluster component Pods on the node to be maintained: |
| 31 | + |
| 32 | + ```shell |
| 33 | + kubectl get pod --all-namespaces -o wide -l pingcap.com/managed-by=tidb-operator | grep ${node_name} |
| 34 | + ``` |
| 35 | + |
| 36 | +### Step 2: Migrate TiDB cluster component Pods |
| 37 | + |
| 38 | +Choose the appropriate Pod migration strategy based on your storage type: |
| 39 | + |
| 40 | +#### Option A: Reschedule Pods (for automatically migratable storage) |
| 41 | + |
| 42 | +If the node storage can be automatically migrated (such as [Amazon EBS](https://aws.amazon.com/ebs/)), you can refer to [Gracefully restart a single Pod of a component](restart-a-tidb-cluster.md) to reschedule component Pods. Using the PD component as an example: |
| 43 | + |
| 44 | +1. Check the PD Pods on the node to be maintained: |
| 45 | + |
| 46 | + ```shell |
| 47 | + kubectl get pod --all-namespaces -o wide -l pingcap.com/component=pd | grep ${node_name} |
| 48 | + ``` |
| 49 | + |
| 50 | +2. Check the instance name corresponding to the PD Pod: |
| 51 | + |
| 52 | + ```shell |
| 53 | + kubectl get pod -n ${namespace} ${pod_name} -o jsonpath='{.metadata.labels.pingcap\.com/instance}' |
| 54 | + ``` |
| 55 | + |
| 56 | +3. Add a new label to the PD instance to trigger rescheduling: |
| 57 | + |
| 58 | + ```shell |
| 59 | + kubectl label pd -n ${namespace} ${pd_instance_name} pingcap.com/restartedAt=2025-06-30T12:00 |
| 60 | + ``` |
| 61 | + |
| 62 | +4. Confirm that the PD Pod has been successfully scheduled to other nodes: |
| 63 | + |
| 64 | + ```shell |
| 65 | + watch kubectl -n ${namespace} get pod -o wide |
| 66 | + ``` |
| 67 | + |
| 68 | +5. Repeat the above steps for other components (TiKV, TiDB, etc.) until all TiDB cluster component Pods on the maintenance node have been migrated. |
| 69 | + |
| 70 | +#### Option B: Recreate instances (for local storage) |
| 71 | + |
| 72 | +If the node storage cannot be automatically migrated (such as local storage), you need to recreate instances: |
| 73 | + |
| 74 | +> **Warning:** |
| 75 | +> |
| 76 | +> Recreating instances will cause data loss. For stateful components like TiKV, ensure that the cluster has sufficient replicas to guarantee data safety. |
| 77 | + |
| 78 | +Using recreating a TiKV instance as an example: |
| 79 | + |
| 80 | +1. Delete the TiKV instance CR. TiDB Operator will delete its associated PVC, ConfigMap, and other resources, and automatically create a new instance: |
| 81 | + |
| 82 | + ```shell |
| 83 | + kubectl delete -n ${namespace} tikv ${tikv_instance_name} |
| 84 | + ``` |
| 85 | + |
| 86 | +2. Wait for the newly created TiKV instance status to become ready: |
| 87 | + |
| 88 | + ```shell |
| 89 | + kubectl get -n ${namespace} tikv ${tikv_instance_name} |
| 90 | + ``` |
| 91 | + |
| 92 | +3. After confirming that the TiDB cluster status is normal and data synchronization is complete, you can continue to maintain other components. |
| 93 | + |
| 94 | +### Step 3: Confirm migration completion |
| 95 | + |
| 96 | +At this point, there should only be Pods managed by DaemonSets (such as network plugins, monitoring agents, etc.): |
| 97 | + |
| 98 | +```shell |
| 99 | +kubectl get pod --all-namespaces -o wide | grep ${node_name} |
| 100 | +``` |
| 101 | + |
| 102 | +### Step 4: Perform node maintenance |
| 103 | + |
| 104 | +At this point, you can safely perform node maintenance operations (such as restart, system update, hardware maintenance, etc.). |
| 105 | + |
| 106 | +### Step 5: Post-maintenance recovery (only for temporary maintenance) |
| 107 | + |
| 108 | +If it is temporary maintenance, you need to restore the node after maintenance is completed: |
| 109 | + |
| 110 | +1. Confirm the node health status: |
| 111 | + |
| 112 | + ```shell |
| 113 | + watch kubectl get node ${node_name} |
| 114 | + ``` |
| 115 | + |
| 116 | + After observing that the node enters the `Ready` state, proceed to the next step. |
| 117 | + |
| 118 | +2. Use the `kubectl uncordon` command to remove the node's scheduling restrictions: |
| 119 | +
|
| 120 | + ```shell |
| 121 | + kubectl uncordon ${node_name} |
| 122 | + ``` |
| 123 | +
|
| 124 | +3. Observe whether all Pods have returned to normal operation: |
| 125 | +
|
| 126 | + ```shell |
| 127 | + kubectl get pod --all-namespaces -o wide | grep ${node_name} |
| 128 | + ``` |
| 129 | +
|
| 130 | + After the Pods return to normal operation, the maintenance operation is complete. |
| 131 | +
|
| 132 | +If it is long-term maintenance or permanent node removal, this step is not required. |
0 commit comments