-
Notifications
You must be signed in to change notification settings - Fork 3
mysql charm — certificates-relation-joined hook fails when mysqld not yet running #189
Description
This is AI assisted triage report, I discussed it with Paulo and it looks to be reasonable.
We encounter it from time to time on top of Sunbeam
======================================================================================
Date observed: 2026-03-20
Charm: mysql, revision 444, channel 8.0/stable
Juju version: 3.6.19
Cloud/substrate: OpenStack (tor3-sqa-sunbeam, cluster_1)
Summary
During a 3-unit MySQL InnoDB cluster deployment, unit mysql/1 never reaches active status.
The certificates-relation-joined hook is dispatched before mysqld has finished starting on
that unit. The hook attempts to connect to the MySQL Unix socket, fails with ERROR 2002
(HY000), and the charm enters a deferred-retry loop that never recovers. The unit remains
stuck in "maintenance / Setting up cluster node" for the entire 64-minute observation window
(60-minute juju wait-for timeout + 14 minutes of continued retrying afterwards).
The other two units (mysql/0 as Primary, mysql/2) and all co-deployed charms
(self-signed-certificates, data-integrator) reach active status without issue.
Deployment Configuration
Charm: mysql, rev 444, channel 8.0/stable
Units: 3 (flag: -n 3)
Co-deployed charms: self-signed-certificates rev 586 (1/stable)
data-integrator rev 380 (latest/edge)
Integrations: self-signed-certificates:certificates -> mysql:certificates
data-integrator:mysql -> mysql:database
MySQL config: CHARM_CHANNEL=8.0/stable, MODEL_NAME=mysql
Timeline of Failure
13:05:13 Machine 1 (mysql/1) provisioning requested by Juju
13:07:13 VM instance 99447039-44c3-445a-b63c-ff0bafef2cb5 started, status ACTIVE
13:07:59 Machine 1 address assigned: 192.168.1.53
13:14:47 mysql/1 charm: "Peer data not yet set. Deferring "
13:14:48 mysql/1 charm: Re-emits StartEvent during certificates-relation-joined
13:14:48 mysql/1 charm: WARNING — "mysqld is not running, skipping flush host cache"
13:14:54 mysql/1 charm: ERROR — socket connection fails, hook fails (see below)
13:15:00 mysql/1 charm: begins retry loop — WARNING every ~5s (943 occurrences total)
14:05:29 juju wait-for 60m timeout expires — unit still in maintenance
14:19:30 Last observed retry — charm still looping 14 minutes after timeout
Error Sequence (from juju controller logsink, model 423a8d80)
unit-mysql-1 2026-03-20 13:14:48 WARNING
certificates:3: mysqld is not running, skipping flush host cache
unit-mysql-1 2026-03-20 13:14:54 ERROR
certificates:3: ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/snap/charmed-mysql/common/var/run/mysqld/mysqld.sock' (2)
unit-mysql-1 2026-03-20 13:14:54 ERROR
certificates:3: Failed to list roles
unit-mysql-1 2026-03-20 13:14:54 ERROR
certificates:3: Uncaught exception while in charm code:
unit-mysql-1 2026-03-20 13:14:54 ERROR (juju.worker.uniter.operation)
hook "certificates-relation-joined" (via hook dispatching script: dispatch)
failed: exit status 1
unit-mysql-1 2026-03-20 13:15:00 WARNING [repeats 943 times until 14:19:30]
Failed to check if local cluster metadata exists
Juju Status at Timeout (14:05:32Z)
App Version Status Scale Charm Channel Rev
data-integrator active 1 data-integrator latest/edge 380
mysql 8.0.44-0ubun... active 3 mysql 8.0/stable 444
self-signed-certificates active 1 self-signed-certificates 1/stable 586
Unit Workload Agent Message
data-integrator/0* active idle
mysql/0* active idle Primary
mysql/1 maintenance executing Setting up cluster node <-- stuck
mysql/2 active idle
self-signed-certificates/0* active idle
Machine State Address Inst id
0 started 192.168.1.219 d50721cd-02be-4eda-86f1-5f27ac83823d (mysql/0)
1 started 192.168.1.53 99447039-44c3-445a-b63c-ff0bafef2cb5 (mysql/1)
2 started 192.168.1.90 1d39784f-2c0b-432c-a448-5bf7808ea107 (mysql/2)
3 started 192.168.1.165 0cbb3131-310b-4407-8976-376f64e6a735 (self-signed-certs/0)
4 started 192.168.1.31 dd583a89-7e21-490c-b472-4ee045719d2f (data-integrator/0)
Root Cause Analysis
The certificates-relation-joined hook is dispatched on mysql/1 while mysqld is still
starting up. The charm code attempts to connect to mysqld via its Unix socket to list
roles, but mysqld is not yet accepting connections. This causes an unhandled exception
(ERROR 2002) and the hook exits with status 1.
After the hook failure the charm enters a deferred-retry loop:
- It continuously re-emits the deferred StartEvent and certificates-relation-joined event
- Each iteration warns "Failed to check if local cluster metadata exists"
- The loop ran at ~5-second intervals for over 64 minutes (943 iterations observed)
- The unit never recovered within the observation window
The issue is that the charm does not adequately guard the certificates-relation-joined
handler against mysqld being unavailable. Other units (mysql/0, mysql/2) did not exhibit
this problem, suggesting they received the hook after mysqld was already running.
This is reproducible in environments where VM provisioning is fast (machines become ACTIVE
within ~2 minutes in this case), giving mysqld less startup time before relation hooks fire.
Note: There was a separate, unrelated Nova MessagingTimeout for machine 4 (data-integrator)
during provisioning. That resolved itself after two retries and is not related to this bug.
What is NOT the cause
- Not a timeout/performance issue (the charm actively errors, not just slow)
- Not a resource contention issue on the hypervisor
- Not a network issue (mysql/0 and mysql/2 clustered successfully on the same subnet)
- Not a data-integrator MessagingTimeout issue (that resolved before mysql/1's problem)
Code Analysis (canonical/mysql-operators, commit 6933f80)
The bug was traced through the source code. Three files are involved:
-
machines/lib/charms/mysql/v0/tls.py — TLS relation handler
lines 92-99:def _on_tls_relation_joined(self, event) -> None:
"""Request certificate when TLS relation joined."""
if not self.charm.unit_initialized():
event.defer()
return
self._request_certificate(None)unit_initialized() calls cluster_metadata_exists() which immediately
tries to open a MySQL socket connection. If mysqld is not running yet
that raises MySQLClusterMetadataExistsError → caught → returns False →
event deferred. So this handler itself defers correctly.HOWEVER: the ops framework re-emits ALL previously deferred events at
the start of any new hook. When certificates-relation-joined fires, the
earlier deferred StartEvent is re-emitted first (same hook context).
This triggers _on_start on the non-leader unit. -
machines/src/charm.py — _on_start handler
The handler calls workload_initialise() and catches a fixed set of
exception types:try:
self.workload_initialise()
except MySQLConfigureMySQLRolesError: ...
except MySQLConfigureMySQLUsersError: ...
except MySQLConfigureInstanceError: ...
except MySQLCreateCustomMySQLDConfigError: ...
except MySQLDNotRestartedError: ...
except MySQLComponentInstallError: ... (warning only)
except MySQLGetMySQLVersionError: ... (debug only)MySQLSetOperatorUserAndStartMySQLDError is NOT in this list.
workload_initialise() calls set_operator_user_and_start_mysqld() which
can raise MySQLSetOperatorUserAndStartMySQLDError — this would become
an uncaught exception → "Uncaught exception while in charm code" →
exit status 1. -
machines/src/mysql_vm_helpers.py — is_mysqld_running and friends
def is_mysqld_running(self) -> bool:
return os.path.exists(MYSQLD_SOCK_FILE)def wait_until_mysql_connection(self, check_port: bool = True) -> None:
if not os.path.exists(MYSQLD_SOCK_FILE):
raise MySQLServiceNotRunningError("MySQL socket file not found")
if check_port and not self.check_mysqlsh_connection():
raise MySQLServiceNotRunningError(...)set_operator_user_and_start_mysqld() calls
wait_until_mysql_connection(check_port=False), which only checks for
socket FILE existence — not whether mysqld is actually accepting
connections. If mysqld creates the socket file but is still
initializing (processing init_file), subsequent SQL calls fail with
ERROR 2002.Contrast with flush_host_cache(), which HAS the correct guard:
def flush_host_cache(self) -> None:
if not self.is_mysqld_running(): # <-- guard
logger.warning("mysqld is not running, skipping flush host cache")
return
...configure_mysql_system_roles() has NO equivalent guard. It calls
self._instance_client_sock.search_instance_roles() directly, which
opens a socket connection and fails with ERROR 2002 if mysqld is not
ready.
Bug summary at the code level
Proximate cause: MySQLSetOperatorUserAndStartMySQLDError is not handled
by _on_start, OR configure_mysql_system_roles() connects to the socket
without verifying mysqld is fully accepting connections, raising an
exception type not caught by _on_start → hook exits 1.
Underlying cause: wait_until_mysql_connection(check_port=False) only
tests socket FILE existence, not actual socket readiness. Passing
check_port=True (or adding a separate socket-level connectivity check)
would block until mysqld is truly ready before allowing SQL operations.
Second issue: the retry loop that follows the hook failure repeatedly
calls cluster_metadata_exists() (every ~5s) which tries to connect to
the not-yet-ready socket, printing "Failed to check if local cluster
metadata exists" 943 times over 64 minutes without ever recovering.
Artifacts
Swift object store (solutions-qa container), prefix: d4685121-759f-4f5a-a1c2-5bfac333d2fd/
Key files:
generated/juju_openstack_controller/juju-crashdump-controller-2026-03-20-14.08.02.tar.gz
-> var/log/juju/logsink.log (all unit charm logs; filter on "423a8d")
-> var/log/juju/models/admin-mysql-423a8d.log (provisioner/firewaller log)
generated/deploy_charm_mysql/juju_status_foundation-openstack_mysql.txt
generated/deploy_charm_mysql/juju_status_foundation-openstack_mysql.json
https://solutions.qa.canonical.com/testruns/d4685121-759f-4f5a-a1c2-5bfac333d2fd