pingcap · ti-chi-bot · Sep 9, 2025 · Aug 22, 2025 · Aug 28, 2025 · Aug 28, 2025
diff --git a/en/TOC.md b/en/TOC.md
@@ -45,6 +45,8 @@
     - [Suspend and Resume a TiDB Cluster](suspend-tidb-cluster.md)
     - [Restart a TiDB Cluster](restart-a-tidb-cluster.md)
     - [Destroy a TiDB Cluster](destroy-a-tidb-cluster.md)
+- Troubleshoot
+  - [Deployment Failures](deploy-failures.md)
 - Reference
   - Architecture
     - [TiDB Operator](architecture.md)

diff --git a/en/deploy-failures.md b/en/deploy-failures.md
@@ -5,4 +5,89 @@ summary: Learn the common deployment failures of TiDB on Kubernetes and their so
 
 # Common Deployment Failures of TiDB on Kubernetes
 
+This document describes the common deployment failures of TiDB on Kubernetes and their solutions.
+
+## The Pod is not created normally
+
+After creating a backup/restore task, if the Pod is not created, you can perform a diagnostic operation by executing the following commands:
+
+```shell
+kubectl get backups -n ${namespace}
+kubectl get jobs -n ${namespace}
+kubectl describe backups -n ${namespace} ${backup_name}
+kubectl describe jobs -n ${namespace} ${backupjob_name}
+kubectl describe restores -n ${namespace} ${restore_name}
+```
+
 ## The Pod is in the Pending state
+
+The Pending state of a Pod is usually caused by conditions of insufficient resources, for example:
+
+- The `StorageClass` of the PVC used by PD, TiKV, TiFlash, Backup, and Restore Pods does not exist or the PV is insufficient.
+- No nodes in the Kubernetes cluster can satisfy the CPU or memory resources requested by the Pod.
+- The certificates used by TiDB or TiProxy components are not configured.
+
+You can check the specific reason for Pending by using the `kubectl describe pod` command:
+
+```shell
+kubectl describe po -n ${namespace} ${pod_name}
+```
+
+### CPU or memory resources are insufficient
+
+If the CPU or memory resources are insufficient, you can lower the CPU or memory resources requested by the corresponding component for scheduling, or add a new Kubernetes node.
+
+### StorageClass of the PVC does not exist
+
+If the `StorageClass` of the PVC cannot be found, take the following steps:
+
+1. Get the available `StorageClass` in the cluster:
+
+    ```shell
+    kubectl get storageclass
+    ```
+
+2. Change `storageClassName` to the name of the `StorageClass` available in the cluster.
+
+3. Update the configuration file:
+
+    If you want to run a backup/restore task, first execute `kubectl delete bk ${backup_name} -n ${namespace}` to delete the old backup/restore task, and then execute `kubectl apply -f backup.yaml` to create a new backup/restore task.
+
+4. Delete the corresponding PVCs:
+
+    ```shell
+    kubectl delete pvc -n ${namespace} ${pvc_name}
+    ```
+
+### Insufficient available PVs
+
+If a `StorageClass` exists in the cluster but the available PVs are insufficient, you need to add PV resources correspondingly.
+
+## The Pod is in the `CrashLoopBackOff` state
+
+A Pod in the `CrashLoopBackOff` state means that the container in the Pod repeatedly aborts (in the loop of abort - restart by `kubelet` - abort). There are many potential causes of `CrashLoopBackOff`.
+
+### View the log of the current container
+
+```shell
+kubectl -n ${namespace} logs -f ${pod_name}
+```
+
+### View the log when the container was last restarted
+
+```shell
+kubectl -n ${namespace} logs -p ${pod_name}
+```
+
+After checking the error messages in the log, you can refer to [Cannot start `tidb-server`](https://docs.pingcap.com/tidb/stable/troubleshoot-tidb-cluster#cannot-start-tidb-server), [Cannot start `tikv-server`](https://docs.pingcap.com/tidb/stable/troubleshoot-tidb-cluster#cannot-start-tikv-server), and [Cannot start `pd-server`](https://docs.pingcap.com/tidb/stable/troubleshoot-tidb-cluster#cannot-start-pd-server) for further troubleshooting.
+
+### `ulimit` is not large enough
+
+TiKV might fail to start when `ulimit` is not large enough. In this case, you can modify the `/etc/security/limits.conf` file of the Kubernetes node to increase the `ulimit`:
+
+```
+root soft nofile 1000000
+root hard nofile 1000000
+root soft core unlimited
+root soft stack 10240
+```
diff --git a/zh/TOC.md b/zh/TOC.md
@@ -45,6 +45,8 @@
     - [挂起和恢复 TiDB 集群](suspend-tidb-cluster.md)
     - [重启 TiDB 集群](restart-a-tidb-cluster.md)
     - [销毁 TiDB 集群](destroy-a-tidb-cluster.md)
+- 故障诊断
+  - [部署错误](deploy-failures.md)
 - 参考
   - 架构
     - [TiDB Operator 架构](architecture.md)

diff --git a/zh/deploy-failures.md b/zh/deploy-failures.md
@@ -5,4 +5,89 @@ summary: 介绍 Kubernetes 上 TiDB 部署的常见错误以及处理办法。
 
 # Kubernetes 上的 TiDB 常见部署错误
 
+本文介绍了 Kubernetes 上 TiDB 常见部署错误以及处理办法。
+
+## Pod 未正常创建
+
+创建备份恢复任务后，如果 Pod 没有创建，则可以通过以下方式进行诊断：
+
+```shell
+kubectl get backups -n ${namespace}
+kubectl get jobs -n ${namespace}
+kubectl describe backups -n ${namespace} ${backup_name}
+kubectl describe jobs -n ${namespace} ${backupjob_name}
+kubectl describe restores -n ${namespace} ${restore_name}
+```
+
 ## Pod 处于 Pending 状态
+
+Pod 处于 Pending 状态，通常都是资源不满足导致的，比如：
+
+* 使用持久化存储的 PD、TiKV、TiFlash、Backup、Restore Pod 使用的 PVC 的 StorageClass 不存在或 PV 不足
+* Kubernetes 集群中没有节点能满足 Pod 申请的 CPU 或内存
+* TiDB、TiProxy 等组件使用的证书没有配置
+
+此时，可以通过 `kubectl describe pod` 命令查看 Pending 的具体原因：
+
+```shell
+kubectl describe po -n ${namespace} ${pod_name}
+```
+
+### CPU 或内存资源不足
+
+如果是 CPU 或内存资源不足，可以通过降低对应组件的 CPU 或内存资源申请，使其能够得到调度，或是增加新的 Kubernetes 节点。
+
+### PVC 的 StorageClass 不存在
+
+如果是 PVC 的 StorageClass 找不到，可采取以下步骤：
+
+1. 通过以下命令获取集群中可用的 StorageClass：
+
+    ```shell
+    kubectl get storageclass
+    ```
+
+2. 将 `storageClassName` 修改为集群中可用的 StorageClass 名字。
+
+3. 使用下述方式更新配置文件：
+
+    如果是运行 backup/restore 的备份/恢复任务，首先需要运行 `kubectl delete bk ${backup_name} -n ${namespace}` 删掉老的备份/恢复任务，再运行 `kubectl apply -f backup.yaml` 重新创建新的备份/恢复任务。
+
+4. 删除对应的 PVC：
+
+    ```shell
+    kubectl delete pvc -n ${namespace} ${pvc_name}
+    ```
+
+### 可用 PV 不足
+
+如果集群中有 StorageClass，但可用的 PV 不足，则需要添加对应的 PV 资源。
+
+## Pod 处于 CrashLoopBackOff 状态
+
+Pod 处于 CrashLoopBackOff 状态意味着 Pod 内的容器重复地异常退出（异常退出后，容器被 Kubelet 重启，重启后又异常退出，如此往复）。定位方法有很多种。
+
+### 查看 Pod 内当前容器的日志
+
+```shell
+kubectl -n ${namespace} logs -f ${pod_name}
+```
+
+### 查看 Pod 内容器上次启动时的日志信息
+
+```shell
+kubectl -n ${namespace} logs -p ${pod_name}
+```
+
+确认日志中的错误信息后，可以根据 [tidb-server 启动报错](https://docs.pingcap.com/zh/tidb/stable/troubleshoot-tidb-cluster/#tidb-server-启动报错 )、[tikv-server 启动报错](https://docs.pingcap.com/zh/tidb/stable/troubleshoot-tidb-cluster/#tikv-server-启动报错)、[pd-server 启动报错](https://docs.pingcap.com/zh/tidb/stable/troubleshoot-tidb-cluster/#pd-server-启动报错)中的指引信息进行进一步排查解决。
+
+### ulimit 不足
+
+另外，TiKV 在 ulimit 不足时也会发生启动失败的状况，对于这种情况，可以修改 Kubernetes 节点的 `/etc/security/limits.conf` 调大 ulimit：
+
+```
+root        soft        nofile        1000000
+root        hard        nofile        1000000
+root        soft        core          unlimited
+root        soft        stack         10240
+```