Skip to content

mysql-k8s cluster unable to recover quorum after full power cycle #141

@mateofloreza

Description

@mateofloreza

Steps to reproduce

Deploy a mysql-k8s cluster (3 units) as part of a Sunbeam OpenStack deployment.
Ensure the cluster is healthy and running normally.
Perform a full power cycle of the lab (power off all servers and power them on again).
Wait for the Kubernetes cluster and Juju controller to come back.
Check the model status with: juju status

The mysql units appear offline and the cluster does not recover quorum automatically.

Example observed state:

Model      Controller          Cloud/Region               Version  SLA          Timestamp
openstack  sunbeam-controller  handy-horse-k8s/localhost  3.6.14   unsupported  08:18:42Z

SAAS           Status  Store  URL
cinder-volume  active  local  admin/openstack-machines.cinder-volume
microceph      active  local  admin/openstack-machines.microceph

App    Version                  Status   Scale  Charm      Channel     Rev  Address         Exposed  Message
mysql  8.0.44-0ubuntu0.22.04.1  waiting      3  mysql-k8s  8.0/stable  343  10.152.183.128  no       installing agent

Unit      Workload     Agent  Address     Ports  Message
mysql/0   maintenance  idle   10.1.0.238         offline
mysql/1   maintenance  idle   10.1.2.205         offline
mysql/2*  maintenance  idle   10.1.1.50          offline

Offer                       Application                 Charm                     Rev  Connected  Endpoint              Interface             Role
cert-distributor            keystone                    keystone-k8s              319  2/2        send-ca-cert          certificate_transfer  provider
certificate-authority       certificate-authority       self-signed-certificates  317  1/1        certificates          tls-certificates      provider
cinder-volume-mysql-router  cinder-volume-mysql-router  mysql-router-k8s          871  1/1        database              mysql_client          provider
keystone-credentials        keystone                    keystone-k8s              319  2/2        identity-credentials  keystone-credentials  provider
keystone-endpoints          keystone                    keystone-k8s              319  1/1        identity-service      keystone              provider
keystone-ops                keystone                    keystone-k8s              319  0/0        identity-ops          keystone-resources    provider
nova                        nova                        nova-k8s                  208  1/1        nova-service          nova                  provider
ovn-relay                   ovn-relay                   ovn-relay-k8s             179  1/1        ovsdb-cms-relay       ovsdb-cms             provider
rabbitmq                    rabbitmq                    rabbitmq-k8s              54   2/2        amqp                  rabbitmq              provider
traefik-rgw                 traefik-rgw                 traefik-k8s               266  1/1        traefik-route         traefik_route         provider

Expected behavior

After the infrastructure and Kubernetes cluster come back online, mysql cluster should automatically recover quorum or at least we should have a recovery procedure or a power cycle procedure.

Ideally, the cluster should either:

  • Automatically elect a primary and rejoin members, or
  • Provide a supported action/workflow to recover from a full outage.

In the older mysql-innodb-cluster charm there was an action for this scenario: https://charmhub.io/mysql-innodb-cluster/actions#reboot-cluster-from-complete-outage

Actual behavior

After the power cycle all units remain in maintenance or offline and the cluster doesn't recover. Restarting the mysqld process manually in the container does not resolve the issue all the times.

juju ssh --container mysql mysql/{0,1,2} pebble restart mysqld

The only workaround so far has been redeploying the application.

Versions

juju: 3.6.14
mysql-k8s rev 343
channel: 8.0/stable
canonical kubernetes 1.32/stable

Logs

2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:41.904675Z 0 [System] [MY-014010] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been started.'
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.034692Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.034746Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.035294Z 0 [Warning] [MY-013595] [Server] Failed to initialize TLS for channel: mysql_admin. See below for the description of exact issue.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.035307Z 0 [Warning] [MY-010069] [Server] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.035315Z 0 [System] [MY-013603] [Server] No TLS configuration was given for channel mysql_admin; re-using TLS configuration of channel mysql_main.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.077610Z 0 [Warning] [MY-010604] [Repl] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a replica and has his hostname changed!! Please use '--relay-log=mysql-2-relay-bin' to avoid this problem.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.107702Z 0 [Warning] [MY-010818] [Server] Error reading GTIDs from relaylog: -1
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.117759Z 9 [Warning] [MY-013360] [Server] '@@binlog_transaction_dependency_tracking' is deprecated and will be removed in a future release.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.122207Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.44-0ubuntu0.22.04.1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Ubuntu).
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.122244Z 0 [System] [MY-013292] [Server] Admin interface ready for connections, address: 'mysql-2.mysql-endpoints.openstack.svc.cluster.local.'  port: 33062
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.122254Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '0.0.0.0' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.124263Z 4 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.131971Z 12 [System] [MY-010597] [Repl] 'CHANGE REPLICATION SOURCE TO FOR CHANNEL 'group_replication_applier' executed'. Previous state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''. New state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''.
2026-02-13T10:15:42.194Z [mysql] 2026-02-13T10:15:42.137543Z 13 [System] [MY-014081] [Repl] Plugin group_replication reported: 'The Group Replication certifier broadcast thread (THD_certifier_broadcast) started.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676616Z 14 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel 'group_replication_applier': corrupted data in log event
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676706Z 14 [ERROR] [MY-013121] [Repl] Replica SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676729Z 14 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676868Z 12 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676957Z 12 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676984Z 12 [ERROR] [MY-011644] [Repl] Plugin group_replication reported: 'Unable to confirm whether the server has left the group or not. Check performance_schema.replication_group_members to check group membership information.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.676997Z 12 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.677102Z 12 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2026-02-13T10:15:43.195Z [mysql] 2026-02-13T10:15:42.677264Z 14 [ERROR] [MY-010586] [Repl] Error running query, replica SQL thread aborted. Fix the problem, and restart the replica SQL thread with "START REPLICA". We stopped at log 'FIRST' position 0
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.064562Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node mysql-0.mysql-endpoints.openstack.svc.cluster.local.:3306. Please check the connection status to this member'
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.185065Z 0 [ERROR] [MY-011502] [Repl] Plugin group_replication reported: 'There was a previous plugin error while the member joined the group. The member will now exit the group.'
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.185215Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to mysql-2.mysql-endpoints.openstack.svc.cluster.local.:3306 on view 17709708250262575:9.'
2026-02-13T10:15:47.195Z [mysql] 2026-02-13T10:15:47.185330Z 0 [ERROR] [MY-011486] [Repl] Plugin group_replication reported: 'Message received while the plugin is not ready, message discarded.'
2026-02-13T10:15:51.197Z [mysql] 2026-02-13T10:15:50.361468Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2026-02-13T10:15:51.197Z [mysql] 2026-02-13T10:15:50.362863Z 0 [System] [MY-014082] [Repl] Plugin group_replication reported: 'The Group Replication certifier broadcast thread (THD_certifier_broadcast) stopped.'

Additional context

This scenario is easy to reproduce in lab environments where the entire infrastructure is powered off and on again.

It would be helpful to know:

Whether this recovery scenario is expected to work automatically.

If there is a recommended recovery procedure for mysql-k8s clusters after a full outage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions