You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
K8SPS-280: Improve full cluster crash recovery (#404)
* K8SPS-280: Improve full cluster crash recovery
Before these changes, we were rebooting the cluster from complete outage
from pod-0, without checking which member has the latest transactions.
Therefore our full cluster recovery implementation was prone to data
loss.
Now we're using mysql-shell's built-in checks to detect the member to
reboot from. For this, mysql-shell requires every member to be
reachable, so it can connect and check GTID's in each one. That means in
case of full cluster crash we need to start each pod and ensure they're
reachable.
We're bringing back the `/var/lib/mysql/full-cluster-crash` to address
this requirement. Pods create this file if they detect they're in full
cluster crash and restart themselves. After the restart, they'll start
the mysqld process but ensure the server started as read only. After
all pods up and running (ready), the operator will run
`dba.rebootClusterFromCompleteOutage()` in one of the MySQL pods. In
which pod we run this is not important, since mysql-shell will connect
to each pod and select the suitable one to reboot.
*Events*
This commit also introduces the event recorder and two events:
1. FullClusterCrashDetected
2. FullClusterCrashRecovered
Users will be able to see these events on `PerconaServerMySQL` object.
For example:
```
$ kubectl describe ps cluster1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FullClusterCrashDetected 19m (x10 over 20m) ps-controller Full cluster crash detected
Normal FullClusterCrashRecovered 17m ps-controller Cluster recovered from full cluster crash
```
*Probe timeouts*
Kubernetes had some problems with timeouts in exec probes which they
fixed in recent releases. But we still see problematic behaviors. For
example, even though Kubernetes successfully detects the timeout in
probe it doesn't count the timeouts as failure. So container is not
restarted even if its liveness probe timed out million times. With this
commit we're handling timeouts by ourselves with contexts.
* fix limits test
* simplify exec commands
* add autoRecovery field
* don't reboot cluster more than necessary
* fix unit tests
* improve logs
0 commit comments