Skip to content

mysql charm — certificates-relation-joined hook fails when mysqld not yet running #189

@marosg42

Description

@marosg42

This is AI assisted triage report, I discussed it with Paulo and it looks to be reasonable.
We encounter it from time to time on top of Sunbeam

======================================================================================

Date observed: 2026-03-20
Charm: mysql, revision 444, channel 8.0/stable
Juju version: 3.6.19
Cloud/substrate: OpenStack (tor3-sqa-sunbeam, cluster_1)

Summary

During a 3-unit MySQL InnoDB cluster deployment, unit mysql/1 never reaches active status.
The certificates-relation-joined hook is dispatched before mysqld has finished starting on
that unit. The hook attempts to connect to the MySQL Unix socket, fails with ERROR 2002
(HY000), and the charm enters a deferred-retry loop that never recovers. The unit remains
stuck in "maintenance / Setting up cluster node" for the entire 64-minute observation window
(60-minute juju wait-for timeout + 14 minutes of continued retrying afterwards).

The other two units (mysql/0 as Primary, mysql/2) and all co-deployed charms
(self-signed-certificates, data-integrator) reach active status without issue.

Deployment Configuration

Charm: mysql, rev 444, channel 8.0/stable
Units: 3 (flag: -n 3)
Co-deployed charms: self-signed-certificates rev 586 (1/stable)
data-integrator rev 380 (latest/edge)
Integrations: self-signed-certificates:certificates -> mysql:certificates
data-integrator:mysql -> mysql:database
MySQL config: CHARM_CHANNEL=8.0/stable, MODEL_NAME=mysql

Timeline of Failure

13:05:13 Machine 1 (mysql/1) provisioning requested by Juju
13:07:13 VM instance 99447039-44c3-445a-b63c-ff0bafef2cb5 started, status ACTIVE
13:07:59 Machine 1 address assigned: 192.168.1.53
13:14:47 mysql/1 charm: "Peer data not yet set. Deferring "
13:14:48 mysql/1 charm: Re-emits StartEvent during certificates-relation-joined
13:14:48 mysql/1 charm: WARNING — "mysqld is not running, skipping flush host cache"
13:14:54 mysql/1 charm: ERROR — socket connection fails, hook fails (see below)
13:15:00 mysql/1 charm: begins retry loop — WARNING every ~5s (943 occurrences total)
14:05:29 juju wait-for 60m timeout expires — unit still in maintenance
14:19:30 Last observed retry — charm still looping 14 minutes after timeout

Error Sequence (from juju controller logsink, model 423a8d80)

unit-mysql-1 2026-03-20 13:14:48 WARNING
certificates:3: mysqld is not running, skipping flush host cache

unit-mysql-1 2026-03-20 13:14:54 ERROR
certificates:3: ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/snap/charmed-mysql/common/var/run/mysqld/mysqld.sock' (2)

unit-mysql-1 2026-03-20 13:14:54 ERROR
certificates:3: Failed to list roles

unit-mysql-1 2026-03-20 13:14:54 ERROR
certificates:3: Uncaught exception while in charm code:

unit-mysql-1 2026-03-20 13:14:54 ERROR (juju.worker.uniter.operation)
hook "certificates-relation-joined" (via hook dispatching script: dispatch)
failed: exit status 1

unit-mysql-1 2026-03-20 13:15:00 WARNING [repeats 943 times until 14:19:30]
Failed to check if local cluster metadata exists

Juju Status at Timeout (14:05:32Z)

App Version Status Scale Charm Channel Rev
data-integrator active 1 data-integrator latest/edge 380
mysql 8.0.44-0ubun... active 3 mysql 8.0/stable 444
self-signed-certificates active 1 self-signed-certificates 1/stable 586

Unit Workload Agent Message
data-integrator/0* active idle
mysql/0* active idle Primary
mysql/1 maintenance executing Setting up cluster node <-- stuck
mysql/2 active idle
self-signed-certificates/0* active idle

Machine State Address Inst id
0 started 192.168.1.219 d50721cd-02be-4eda-86f1-5f27ac83823d (mysql/0)
1 started 192.168.1.53 99447039-44c3-445a-b63c-ff0bafef2cb5 (mysql/1)
2 started 192.168.1.90 1d39784f-2c0b-432c-a448-5bf7808ea107 (mysql/2)
3 started 192.168.1.165 0cbb3131-310b-4407-8976-376f64e6a735 (self-signed-certs/0)
4 started 192.168.1.31 dd583a89-7e21-490c-b472-4ee045719d2f (data-integrator/0)

Root Cause Analysis

The certificates-relation-joined hook is dispatched on mysql/1 while mysqld is still
starting up. The charm code attempts to connect to mysqld via its Unix socket to list
roles, but mysqld is not yet accepting connections. This causes an unhandled exception
(ERROR 2002) and the hook exits with status 1.

After the hook failure the charm enters a deferred-retry loop:

  • It continuously re-emits the deferred StartEvent and certificates-relation-joined event
  • Each iteration warns "Failed to check if local cluster metadata exists"
  • The loop ran at ~5-second intervals for over 64 minutes (943 iterations observed)
  • The unit never recovered within the observation window

The issue is that the charm does not adequately guard the certificates-relation-joined
handler against mysqld being unavailable. Other units (mysql/0, mysql/2) did not exhibit
this problem, suggesting they received the hook after mysqld was already running.

This is reproducible in environments where VM provisioning is fast (machines become ACTIVE
within ~2 minutes in this case), giving mysqld less startup time before relation hooks fire.

Note: There was a separate, unrelated Nova MessagingTimeout for machine 4 (data-integrator)
during provisioning. That resolved itself after two retries and is not related to this bug.

What is NOT the cause

  • Not a timeout/performance issue (the charm actively errors, not just slow)
  • Not a resource contention issue on the hypervisor
  • Not a network issue (mysql/0 and mysql/2 clustered successfully on the same subnet)
  • Not a data-integrator MessagingTimeout issue (that resolved before mysql/1's problem)

Code Analysis (canonical/mysql-operators, commit 6933f80)

The bug was traced through the source code. Three files are involved:

  1. machines/lib/charms/mysql/v0/tls.py — TLS relation handler
    lines 92-99:

    def _on_tls_relation_joined(self, event) -> None:
    """Request certificate when TLS relation joined."""
    if not self.charm.unit_initialized():
    event.defer()
    return
    self._request_certificate(None)

    unit_initialized() calls cluster_metadata_exists() which immediately
    tries to open a MySQL socket connection. If mysqld is not running yet
    that raises MySQLClusterMetadataExistsError → caught → returns False →
    event deferred. So this handler itself defers correctly.

    HOWEVER: the ops framework re-emits ALL previously deferred events at
    the start of any new hook. When certificates-relation-joined fires, the
    earlier deferred StartEvent is re-emitted first (same hook context).
    This triggers _on_start on the non-leader unit.

  2. machines/src/charm.py — _on_start handler
    The handler calls workload_initialise() and catches a fixed set of
    exception types:

    try:
    self.workload_initialise()
    except MySQLConfigureMySQLRolesError: ...
    except MySQLConfigureMySQLUsersError: ...
    except MySQLConfigureInstanceError: ...
    except MySQLCreateCustomMySQLDConfigError: ...
    except MySQLDNotRestartedError: ...
    except MySQLComponentInstallError: ... (warning only)
    except MySQLGetMySQLVersionError: ... (debug only)

    MySQLSetOperatorUserAndStartMySQLDError is NOT in this list.
    workload_initialise() calls set_operator_user_and_start_mysqld() which
    can raise MySQLSetOperatorUserAndStartMySQLDError — this would become
    an uncaught exception → "Uncaught exception while in charm code" →
    exit status 1.

  3. machines/src/mysql_vm_helpers.py — is_mysqld_running and friends

    def is_mysqld_running(self) -> bool:
    return os.path.exists(MYSQLD_SOCK_FILE)

    def wait_until_mysql_connection(self, check_port: bool = True) -> None:
    if not os.path.exists(MYSQLD_SOCK_FILE):
    raise MySQLServiceNotRunningError("MySQL socket file not found")
    if check_port and not self.check_mysqlsh_connection():
    raise MySQLServiceNotRunningError(...)

    set_operator_user_and_start_mysqld() calls
    wait_until_mysql_connection(check_port=False), which only checks for
    socket FILE existence — not whether mysqld is actually accepting
    connections. If mysqld creates the socket file but is still
    initializing (processing init_file), subsequent SQL calls fail with
    ERROR 2002.

    Contrast with flush_host_cache(), which HAS the correct guard:

    def flush_host_cache(self) -> None:
    if not self.is_mysqld_running(): # <-- guard
    logger.warning("mysqld is not running, skipping flush host cache")
    return
    ...

    configure_mysql_system_roles() has NO equivalent guard. It calls
    self._instance_client_sock.search_instance_roles() directly, which
    opens a socket connection and fails with ERROR 2002 if mysqld is not
    ready.

Bug summary at the code level

Proximate cause: MySQLSetOperatorUserAndStartMySQLDError is not handled
by _on_start, OR configure_mysql_system_roles() connects to the socket
without verifying mysqld is fully accepting connections, raising an
exception type not caught by _on_start → hook exits 1.

Underlying cause: wait_until_mysql_connection(check_port=False) only
tests socket FILE existence, not actual socket readiness. Passing
check_port=True (or adding a separate socket-level connectivity check)
would block until mysqld is truly ready before allowing SQL operations.

Second issue: the retry loop that follows the hook failure repeatedly
calls cluster_metadata_exists() (every ~5s) which tries to connect to
the not-yet-ready socket, printing "Failed to check if local cluster
metadata exists" 943 times over 64 minutes without ever recovering.

Artifacts

Swift object store (solutions-qa container), prefix: d4685121-759f-4f5a-a1c2-5bfac333d2fd/

Key files:
generated/juju_openstack_controller/juju-crashdump-controller-2026-03-20-14.08.02.tar.gz
-> var/log/juju/logsink.log (all unit charm logs; filter on "423a8d")
-> var/log/juju/models/admin-mysql-423a8d.log (provisioner/firewaller log)
generated/deploy_charm_mysql/juju_status_foundation-openstack_mysql.txt
generated/deploy_charm_mysql/juju_status_foundation-openstack_mysql.json

https://solutions.qa.canonical.com/testruns/d4685121-759f-4f5a-a1c2-5bfac333d2fd

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions