Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions en/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
- [Destroy a TiDB Cluster](destroy-a-tidb-cluster.md)
- Troubleshoot
- [Deployment Failures](deploy-failures.md)
- [Cluster Exceptions](exceptions.md)
- Reference
- Architecture
- [TiDB Operator](architecture.md)
Expand Down
52 changes: 51 additions & 1 deletion en/exceptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,54 @@ summary: Learn the common exceptions during the operation of TiDB clusters on Ku

# Common Cluster Exceptions of TiDB on Kubernetes

## TiKV Store is in `Tombstone` status abnormally
This document describes common exceptions during the operation of TiDB clusters on Kubernetes and their solutions.

## Persistent connections are abnormally terminated in TiDB

Load balancers often set the idle connection timeout. If no data is sent via a connection for a specific period of time, the load balancer closes the connection.

- If a persistent connection is terminated when you use TiDB, check the middleware program between the client and the TiDB server.
- If the idle timeout is not long enough for your query, try to set the timeout to a larger value. If you cannot reset it, enable the `tcp-keep-alive` option in TiDB.

On Linux, the keepalive probe packet is sent every 7,200 seconds by default. To shorten the interval, configure `sysctls` via the `podSecurityContext` field.

- If `--allowed-unsafe-sysctls=net.*` can be configured for [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) in the Kubernetes cluster, configure TiDBGroup using the [Overlay](overlay.md) feature as follows:

```yaml
apiVersion: core.pingcap.com/v1alpha1
kind: TiDBGroup
spec:
template:
spec:
overlay:
pod:
spec:
securityContext:
sysctls:
- name: net.ipv4.tcp_keepalive_time
value: "300"
```

- If `--allowed-unsafe-sysctls=net.*` cannot be configured for [kubelet](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) in the Kubernetes cluster, configure TiDBGroup using the [Overlay](overlay.md) feature as follows:

```yaml
apiVersion: core.pingcap.com/v1alpha1
kind: TiDBGroup
spec:
template:
spec:
overlay:
pod:
spec:
initContainers:
- name: init
image: busybox
commands:
- "sh"
- "-c"
- "sysctl"
- "-w"
- "net.ipv4.tcp_keepalive_time=300"
securityContext:
privileged: true
```
2 changes: 1 addition & 1 deletion en/scale-a-tidb-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ To scale a TiDB cluster horizontally, use `kubectl` to modify the `spec.replicas
>
> - When the TiKV component scales in, TiDB Operator calls the PD interface to mark the corresponding TiKV instance as offline, and then migrates the data on it to other TiKV nodes. During the data migration, the TiKV Pod is still in the `Running` state, and the corresponding Pod is deleted only after the data migration is completed. The time consumed by scaling in depends on the amount of data on the TiKV instance to be scaled in. You can check whether TiKV is in the `Removing` state by running `kubectl get -n ${namespace} tikv`.
> - When the number of `Serving` TiKV is equal to or less than the value of the `MaxReplicas` parameter in the PD configuration, the TiKV components cannot be scaled in.
> - The TiKV component does not support scaling out while a scale-in operation is in progress. Forcing a scale-out operation might cause anomalies in the cluster. If an anomaly already happens, refer to [TiKV Store is in Tombstone status abnormally](exceptions.md#tikv-store-is-in-tombstone-status-abnormally) to fix it.
> - The TiKV component does not support scaling out while a scale-in operation is in progress. Forcing a scale-out operation might cause anomalies in the cluster.
> - The TiFlash component has the same scale-in logic as TiKV.

## Vertical scaling
Expand Down
1 change: 1 addition & 0 deletions zh/TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
- [销毁 TiDB 集群](destroy-a-tidb-cluster.md)
- 故障诊断
- [部署错误](deploy-failures.md)
- [集群异常](exceptions.md)
- 参考
- 架构
- [TiDB Operator 架构](architecture.md)
Expand Down
49 changes: 48 additions & 1 deletion zh/exceptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,51 @@ summary: 介绍 TiDB 集群运行过程中常见异常以及处理办法。

# Kubernetes 上的 TiDB 集群常见异常

## TiKV Store 异常进入 Tombstone 状态
本文介绍 TiDB 集群运行过程中常见异常以及处理办法。

## TiDB 长连接被异常中断

许多负载均衡器 (Load Balancer) 会设置连接空闲超时时间。当连接上没有数据传输的时间超过设定值,负载均衡器会主动将连接中断。若发现 TiDB 使用过程中,长查询会被异常中断,可检查客户端与 TiDB 服务端之间的中间件程序。若其连接空闲超时时间较短,可尝试增大该超时时间。若不可修改,可打开 TiDB `tcp-keep-alive` 选项,启用 TCP keepalive 特性。

默认情况下,Linux 发送 keepalive 探测包的等待时间为 7200 秒。若需减少该时间,可通过 `podSecurityContext` 字段配置 `sysctls`。

- 如果 Kubernetes 集群内的 [kubelet](https://kubernetes.io/zh-cn/docs/reference/command-line-tools-reference/kubelet/) 允许配置 `--allowed-unsafe-sysctls=net.*`,请使用 [Overlay](overlay.md) 功能按如下方式配置 TiDBGroup:

```yaml
apiVersion: core.pingcap.com/v1alpha1
kind: TiDBGroup
spec:
template:
spec:
overlay:
pod:
spec:
securityContext:
sysctls:
- name: net.ipv4.tcp_keepalive_time
value: "300"
```

- 如果 Kubernetes 集群内的 [kubelet](https://kubernetes.io/zh-cn/docs/reference/command-line-tools-reference/kubelet/) 不允许配置 `--allowed-unsafe-sysctls=net.*`,请使用 [Overlay](overlay.md) 功能按如下方式配置 TiDBGroup:

```yaml
apiVersion: core.pingcap.com/v1alpha1
kind: TiDBGroup
spec:
template:
spec:
overlay:
pod:
spec:
initContainers:
- name: init
image: busybox
commands:
- "sh"
- "-c"
- "sysctl"
- "-w"
- "net.ipv4.tcp_keepalive_time=300"
securityContext:
privileged: true
```
2 changes: 1 addition & 1 deletion zh/scale-a-tidb-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ summary: 了解如何在 Kubernetes 上手动对 TiDB 集群进行水平和垂
>
> - TiKV 组件在缩容过程中,TiDB Operator 会调用 PD 接口将对应 TiKV 标记为下线,然后将其上数据迁移到其它 TiKV 节点,在数据迁移期间 TiKV Pod 依然是 `Running` 状态,数据迁移完成后对应 Pod 才会被删除,缩容时间与待缩容的 TiKV 上的数据量有关,可以通过 `kubectl get -n ${namespace} tikv` 查看 TiKV 是否处于下线 `Removing` 状态。
> - 当 `Serving` 状态的 TiKV 数量小于或等于 PD 配置中 `MaxReplicas` 的参数值时,无法缩容 TiKV 组件。
> - TiKV 组件不支持在缩容过程中进行扩容操作,强制执行此操作可能导致集群状态异常。假如异常已经发生,可以参考 [TiKV Store 异常进入 Tombstone 状态](exceptions.md#tikv-store-异常进入-tombstone-状态)进行解决。
> - TiKV 组件不支持在缩容过程中进行扩容操作,强制执行此操作可能导致集群状态异常。
> - TiFlash 组件缩容处理逻辑和 TiKV 组件相同。

## 垂直扩缩容
Expand Down