|
| 1 | +# Proposal: VIP-Management |
| 2 | + |
| 3 | +- Author(s): [djshow832](https://github.com/djshow832) |
| 4 | +- Tracking Issue: https://github.com/pingcap/tiproxy/issues/583 |
| 5 | + |
| 6 | +## Abstract |
| 7 | + |
| 8 | +This proposes a design of managing VIP on TiProxy clusters to achieve high availability of TiProxy without deploying third-party tools. |
| 9 | + |
| 10 | +## Terms |
| 11 | + |
| 12 | +- VIP: Virtual IP |
| 13 | +- NIC: Network Interface Card |
| 14 | +- ARP: Address Resolution Protocol |
| 15 | +- VRRP: Virtual Router Redundancy Protocol |
| 16 | +- MMM: Multi-Master Replication Manager for MySQL |
| 17 | +- MHA: Master High Availability |
| 18 | + |
| 19 | +## Background |
| 20 | + |
| 21 | +In a self-hosted TiDB cluster with TiProxy, TiProxy is typically the endpoint for clients. To achieve high availability, users may deploy multiple TiProxy instances and only one serves requests so that the client can configure only one database address. When the active TiProxy is down, the cluster should elect another TiProxy automatically and the client doesn't need to update the database address. |
| 22 | + |
| 23 | +So we need a VIP solution. The VIP is always bound to an available TiProxy node. When the active node is down, VIP floats to another node. |
| 24 | + |
| 25 | +<img src="./imgs/vip-arch.png" alt="vip architecture" width="600"> |
| 26 | + |
| 27 | +Currently, typical solutions include: |
| 28 | + |
| 29 | +- Deploy keepalived together with TiProxy. Keepalived is capable of both health checks and VIP management. |
| 30 | +- Deploy a crontab job to check the health of TiProxy and set VIP |
| 31 | + |
| 32 | +Both ways are not easy to use. This design proposes a solution to enable the TiProxy cluster to manage VIP by itself. |
| 33 | + |
| 34 | +## Goals |
| 35 | + |
| 36 | +- Bind a VIP to an available TiProxy node and switch the VIP when the node becomes unavailable |
| 37 | +- Support VIP management on self-hosted TiDB clusters that run on bare metal with Linux |
| 38 | + |
| 39 | +## Non-Goals |
| 40 | + |
| 41 | +- Support configuring weights of TiProxy nodes |
| 42 | +- Support configuring multiple VIPs for a TiProxy cluster |
| 43 | +- Support VIP management on Docker, Kubernetes, or cloud |
| 44 | +- Support VIP management on non-Linux operating systems |
| 45 | + |
| 46 | +## Proposal |
| 47 | + |
| 48 | +### Active Node Election |
| 49 | + |
| 50 | +Firstly, the TiProxy cluster needs to elect an available instance to be the active node. Etcd is built in PD and is capable of leader election, so we can just use the Etcd election. The first instance booted will be the leader for the first election round. |
| 51 | + |
| 52 | +When an instance is chosen to be active, it binds VIP to itself. It unbinds the VIP when: |
| 53 | + |
| 54 | +- It finds that it's no longer the leader, maybe because its network is unstable and Etcd evicts it |
| 55 | +- It's shutting down, maybe because TiProxy instances are rolling upgrade |
| 56 | + |
| 57 | +### Failover |
| 58 | + |
| 59 | +The Etcd session TTL determines the RTO. Longer TTL makes the RTO longer, while shorter TTL makes the leader switch frequently in a bad network. We set it to 3 seconds, thus the RTO should be nearly 3 seconds. |
| 60 | + |
| 61 | +During the shutdown of the active node, the active node's unbinding and the standby node's binding happen concurrently. If the unbinding comes first, the clients may fail to connect. To ensure the binding comes first, the active node resigns the leader before graceful waiting and unbinds after graceful waiting. |
| 62 | + |
| 63 | +When the PD leader is down and before a new leader is elected, all TiProxy nodes can't connect to the Etcd server. If the owner unbinds the VIP, the clients can't connect to TiProxy with the VIP. Thus, the owner doesn't unbind the VIP until the next active node is elected. |
| 64 | + |
| 65 | +### Adding and Deleting VIP |
| 66 | + |
| 67 | +Once a node is chosen to be active, it binds the VIP to itself through 2 steps: |
| 68 | + |
| 69 | +1. Attach a secondary IP to the specified NIC through netlink |
| 70 | +2. Notify the whole subnet through ARP about the IP and MAC address so that the clients update the ARP cache |
| 71 | + |
| 72 | +There may be some time when the previous active node doesn't unbind the VIP in time and the new active node binds the VIP. The second step ensures that the clients connect to the new node because the ARP cache is updated. The connections to the previous node continue until the clients disconnect them. |
| 73 | + |
| 74 | +These steps are equal to the Linux commands: |
| 75 | + |
| 76 | +```shell |
| 77 | +ip addr add 192.168.148.100/24 dev eth0 |
| 78 | +arping -q -c 1 -U -I eth0 192.168.148.100 |
| 79 | +``` |
| 80 | + |
| 81 | +The secondary IP is used in MySQL HA clusters such as MMM and MHA. The limitation is that the secondary IP should be reserved in the subnet and it only works in the same subnet. |
| 82 | + |
| 83 | +The TiProxy user must be privileged to run `ip addr add`, `ip addr del` and `arping`, meaning that it should be the `root`. However, TiProxy is typically deployed by TiUP and TiUP only needs the `sudo` permission, so TiProxy should retry with `sudo` if the permission is denied, but it requires `ip` and `arping` to be installed. |
| 84 | + |
| 85 | +## Configuration |
| 86 | + |
| 87 | +All TiProxy instances have the same configuration: |
| 88 | + |
| 89 | +```yaml |
| 90 | +[ha] |
| 91 | + vip="192.168.148.100" |
| 92 | + interface="eth0" |
| 93 | +``` |
| 94 | + |
| 95 | +`vip` declares the VIP and `interface` declares the network interface (or NIC device) to which the VIP is bound. If any of them is not configured, the instance won't preempt VIP. |
| 96 | + |
| 97 | +It's possible to update configurations online but it's unnecessary. The clients need to update the database address and it interrupts the business anyway, so we don't support update configurations online. |
| 98 | + |
| 99 | +## Observability |
| 100 | + |
| 101 | +Besides logs, we can show the current active node on Grafana. |
| 102 | + |
| 103 | +## Alternatives |
| 104 | + |
| 105 | +### Consensus Algorithms |
| 106 | + |
| 107 | +Some products use consensus algorithms such as Raft and Paxos to elect the active node. It's straightforward but has some disadvantages: |
| 108 | + |
| 109 | +- The consensus algorithms need at least 3 nodes, while users usually need only 2. |
| 110 | +- If there's a network partition, the elected node must be able to connect to the PD leader, while the active node elected by the consensus algorithm may be in another partition with the PD leader. If so, the node will route to the TiDB instances that are unable to connect to the PD leader either. |
| 111 | + |
| 112 | +### VRRP |
| 113 | + |
| 114 | +VRRP is another VIP solution and is applied by Keepalived, a tool widely used by proxies, including HAProxy. |
| 115 | + |
| 116 | +The problem is that VRRP is too complicated to troubleshoot. |
| 117 | + |
| 118 | +## Future works |
| 119 | + |
| 120 | +### Weight Configuration |
| 121 | + |
| 122 | +Node weights may be useful when users have preferences for active nodes. If the node with the highest weight is available, it holds the VIP until it's down. On top of this, some products also have a preempt mode. That is, when the preferred node recovers, it should take back the VIP even if the current active node is still available. |
| 123 | + |
| 124 | +Although some products support configuring node weights, it's not so straightforward to implement on etcd and may not be necessary. We'll consider it in the future if users require it. Currently, all the nodes share the same possibility of being active. |
| 125 | + |
| 126 | +### Multiple VIPs |
| 127 | + |
| 128 | +Some MySQL clusters use one VIP for write nodes and multiple VIPs for read nodes. Similarly, TiProxy can have multiple VIPs to expose multiple endpoints for resource isolation. It needs to partition TiProxy and TiDB instances into node groups and each TiProxy only routes to the TiDB in the same group. |
| 129 | + |
| 130 | +It changes the election procedure and TiProxy configuration. We'll consider it if users require it. |
0 commit comments