-
Notifications
You must be signed in to change notification settings - Fork 3
mysql-k8s cluster unable to recover quorum after full power cycle #141
Description
Steps to reproduce
Deploy a mysql-k8s cluster (3 units) as part of a Sunbeam OpenStack deployment.
Ensure the cluster is healthy and running normally.
Perform a full power cycle of the lab (power off all servers and power them on again).
Wait for the Kubernetes cluster and Juju controller to come back.
Check the model status with: juju status
The mysql units appear offline and the cluster does not recover quorum automatically.
Example observed state:
Model Controller Cloud/Region Version SLA Timestamp
openstack sunbeam-controller handy-horse-k8s/localhost 3.6.14 unsupported 08:18:42Z
SAAS Status Store URL
cinder-volume active local admin/openstack-machines.cinder-volume
microceph active local admin/openstack-machines.microceph
App Version Status Scale Charm Channel Rev Address Exposed Message
mysql 8.0.44-0ubuntu0.22.04.1 waiting 3 mysql-k8s 8.0/stable 343 10.152.183.128 no installing agent
Unit Workload Agent Address Ports Message
mysql/0 maintenance idle 10.1.0.238 offline
mysql/1 maintenance idle 10.1.2.205 offline
mysql/2* maintenance idle 10.1.1.50 offline
Offer Application Charm Rev Connected Endpoint Interface Role
cert-distributor keystone keystone-k8s 319 2/2 send-ca-cert certificate_transfer provider
certificate-authority certificate-authority self-signed-certificates 317 1/1 certificates tls-certificates provider
cinder-volume-mysql-router cinder-volume-mysql-router mysql-router-k8s 871 1/1 database mysql_client provider
keystone-credentials keystone keystone-k8s 319 2/2 identity-credentials keystone-credentials provider
keystone-endpoints keystone keystone-k8s 319 1/1 identity-service keystone provider
keystone-ops keystone keystone-k8s 319 0/0 identity-ops keystone-resources provider
nova nova nova-k8s 208 1/1 nova-service nova provider
ovn-relay ovn-relay ovn-relay-k8s 179 1/1 ovsdb-cms-relay ovsdb-cms provider
rabbitmq rabbitmq rabbitmq-k8s 54 2/2 amqp rabbitmq provider
traefik-rgw traefik-rgw traefik-k8s 266 1/1 traefik-route traefik_route provider
Expected behavior
After the infrastructure and Kubernetes cluster come back online, mysql cluster should automatically recover quorum or at least we should have a recovery procedure or a power cycle procedure.
Ideally, the cluster should either:
- Automatically elect a primary and rejoin members, or
- Provide a supported action/workflow to recover from a full outage.
In the older mysql-innodb-cluster charm there was an action for this scenario: https://charmhub.io/mysql-innodb-cluster/actions#reboot-cluster-from-complete-outage
Actual behavior
After the power cycle all units remain in maintenance or offline and the cluster doesn't recover. Restarting the mysqld process manually in the container does not resolve the issue all the times.
juju ssh --container mysql mysql/{0,1,2} pebble restart mysqld
The only workaround so far has been redeploying the application.
Versions
juju: 3.6.14
mysql-k8s rev 343
channel: 8.0/stable
canonical kubernetes 1.32/stable
Logs
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:41.904675Z 0 [System] [MY-014010] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been started.'
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.034692Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.034746Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.035294Z 0 [Warning] [MY-013595] [Server] Failed to initialize TLS for channel: mysql_admin. See below for the description of exact issue.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.035307Z 0 [Warning] [MY-010069] [Server] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.035315Z 0 [System] [MY-013603] [Server] No TLS configuration was given for channel mysql_admin; re-using TLS configuration of channel mysql_main.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.077610Z 0 [Warning] [MY-010604] [Repl] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a replica and has his hostname changed!! Please use '--relay-log=mysql-2-relay-bin' to avoid this problem.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.107702Z 0 [Warning] [MY-010818] [Server] Error reading GTIDs from relaylog: -1
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.117759Z 9 [Warning] [MY-013360] [Server] '@@binlog_transaction_dependency_tracking' is deprecated and will be removed in a future release.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.122207Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.44-0ubuntu0.22.04.1' socket: '/var/run/mysqld/mysqld.sock' port: 3306 (Ubuntu).
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.122244Z 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: 'mysql-2.mysql-endpoints.openstack.svc.cluster.local.' port: 33062
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.122254Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '0.0.0.0' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.124263Z 4 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.131971Z 12 [System] [MY-010597] [Repl] 'CHANGE REPLICATION SOURCE TO FOR CHANNEL 'group_replication_applier' executed'. Previous state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''. New state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.137543Z 13 [System] [MY-014081] [Repl] Plugin group_replication reported: 'The Group Replication certifier broadcast thread (THD_certifier_broadcast) started.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676616Z 14 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel 'group_replication_applier': corrupted data in log event
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676706Z 14 [ERROR] [MY-013121] [Repl] Replica SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676729Z 14 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676868Z 12 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676957Z 12 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676984Z 12 [ERROR] [MY-011644] [Repl] Plugin group_replication reported: 'Unable to confirm whether the server has left the group or not. Check performance_schema.replication_group_members to check group membership information.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676997Z 12 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.677102Z 12 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.677264Z 14 [ERROR] [MY-010586] [Repl] Error running query, replica SQL thread aborted. Fix the problem, and restart the replica SQL thread with "START REPLICA". We stopped at log 'FIRST' position 0
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.064562Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node mysql-0.mysql-endpoints.openstack.svc.cluster.local.:3306. Please check the connection status to this member'
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.185065Z 0 [ERROR] [MY-011502] [Repl] Plugin group_replication reported: 'There was a previous plugin error while the member joined the group. The member will now exit the group.'
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.185215Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to mysql-2.mysql-endpoints.openstack.svc.cluster.local.:3306 on view 17709708250262575:9.'
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.185330Z 0 [ERROR] [MY-011486] [Repl] Plugin group_replication reported: 'Message received while the plugin is not ready, message discarded.'
2026-02-13T10:15:51.197Z [mysql] 2026-02-13T10:15:50.361468Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2026-02-13T10:15:51.197Z [mysql] 2026-02-13T10:15:50.362863Z 0 [System] [MY-014082] [Repl] Plugin group_replication reported: 'The Group Replication certifier broadcast thread (THD_certifier_broadcast) stopped.'
Additional context
This scenario is easy to reproduce in lab environments where the entire infrastructure is powered off and on again.
It would be helpful to know:
Whether this recovery scenario is expected to work automatically.
If there is a recommended recovery procedure for mysql-k8s clusters after a full outage.