mysql charm — certificates-relation-joined hook fails when mysqld not yet running

This is AI assisted triage report, I discussed it with Paulo and it looks to be reasonable.
We encounter it from time to time on top of Sunbeam

======================================================================================

Date observed:    2026-03-20
Charm:            mysql, revision 444, channel 8.0/stable
Juju version:     3.6.19
Cloud/substrate:  OpenStack (tor3-sqa-sunbeam, cluster_1)

Summary
-------
During a 3-unit MySQL InnoDB cluster deployment, unit mysql/1 never reaches active status.
The certificates-relation-joined hook is dispatched before mysqld has finished starting on
that unit. The hook attempts to connect to the MySQL Unix socket, fails with ERROR 2002
(HY000), and the charm enters a deferred-retry loop that never recovers. The unit remains
stuck in "maintenance / Setting up cluster node" for the entire 64-minute observation window
(60-minute juju wait-for timeout + 14 minutes of continued retrying afterwards).

The other two units (mysql/0 as Primary, mysql/2) and all co-deployed charms
(self-signed-certificates, data-integrator) reach active status without issue.


Deployment Configuration
------------------------
  Charm:                mysql, rev 444, channel 8.0/stable
  Units:                3  (flag: -n 3)
  Co-deployed charms:   self-signed-certificates rev 586 (1/stable)
                        data-integrator rev 380 (latest/edge)
  Integrations:         self-signed-certificates:certificates -> mysql:certificates
                        data-integrator:mysql -> mysql:database
  MySQL config:         CHARM_CHANNEL=8.0/stable, MODEL_NAME=mysql


Timeline of Failure
-------------------
  13:05:13  Machine 1 (mysql/1) provisioning requested by Juju
  13:07:13  VM instance 99447039-44c3-445a-b63c-ff0bafef2cb5 started, status ACTIVE
  13:07:59  Machine 1 address assigned: 192.168.1.53
  13:14:47  mysql/1 charm: "Peer data not yet set. Deferring <StartEvent>"
  13:14:48  mysql/1 charm: Re-emits StartEvent during certificates-relation-joined
  13:14:48  mysql/1 charm: WARNING — "mysqld is not running, skipping flush host cache"
  13:14:54  mysql/1 charm: ERROR — socket connection fails, hook fails (see below)
  13:15:00  mysql/1 charm: begins retry loop — WARNING every ~5s (943 occurrences total)
  14:05:29  juju wait-for 60m timeout expires — unit still in maintenance
  14:19:30  Last observed retry — charm still looping 14 minutes after timeout


Error Sequence (from juju controller logsink, model 423a8d80)
-------------------------------------------------------------
  unit-mysql-1 2026-03-20 13:14:48 WARNING
      certificates:3: mysqld is not running, skipping flush host cache

  unit-mysql-1 2026-03-20 13:14:54 ERROR
      certificates:3: ERROR 2002 (HY000): Can't connect to local MySQL server
      through socket '/var/snap/charmed-mysql/common/var/run/mysqld/mysqld.sock' (2)

  unit-mysql-1 2026-03-20 13:14:54 ERROR
      certificates:3: Failed to list roles

  unit-mysql-1 2026-03-20 13:14:54 ERROR
      certificates:3: Uncaught exception while in charm code:

  unit-mysql-1 2026-03-20 13:14:54 ERROR (juju.worker.uniter.operation)
      hook "certificates-relation-joined" (via hook dispatching script: dispatch)
      failed: exit status 1

  unit-mysql-1 2026-03-20 13:15:00 WARNING  [repeats 943 times until 14:19:30]
      Failed to check if local cluster metadata exists


Juju Status at Timeout (14:05:32Z)
------------------------------------
  App                       Version          Status  Scale  Charm                     Channel      Rev
  data-integrator                            active      1  data-integrator           latest/edge  380
  mysql                     8.0.44-0ubun...  active      3  mysql                     8.0/stable   444
  self-signed-certificates                   active      1  self-signed-certificates  1/stable     586

  Unit                         Workload     Agent      Message
  data-integrator/0*           active       idle
  mysql/0*                     active       idle       Primary
  mysql/1                      maintenance  executing  Setting up cluster node   <-- stuck
  mysql/2                      active       idle
  self-signed-certificates/0*  active       idle

  Machine  State    Address        Inst id
  0        started  192.168.1.219  d50721cd-02be-4eda-86f1-5f27ac83823d  (mysql/0)
  1        started  192.168.1.53   99447039-44c3-445a-b63c-ff0bafef2cb5  (mysql/1)
  2        started  192.168.1.90   1d39784f-2c0b-432c-a448-5bf7808ea107  (mysql/2)
  3        started  192.168.1.165  0cbb3131-310b-4407-8976-376f64e6a735  (self-signed-certs/0)
  4        started  192.168.1.31   dd583a89-7e21-490c-b472-4ee045719d2f  (data-integrator/0)


Root Cause Analysis
-------------------
The certificates-relation-joined hook is dispatched on mysql/1 while mysqld is still
starting up. The charm code attempts to connect to mysqld via its Unix socket to list
roles, but mysqld is not yet accepting connections. This causes an unhandled exception
(ERROR 2002) and the hook exits with status 1.

After the hook failure the charm enters a deferred-retry loop:
  - It continuously re-emits the deferred StartEvent and certificates-relation-joined event
  - Each iteration warns "Failed to check if local cluster metadata exists"
  - The loop ran at ~5-second intervals for over 64 minutes (943 iterations observed)
  - The unit never recovered within the observation window

The issue is that the charm does not adequately guard the certificates-relation-joined
handler against mysqld being unavailable. Other units (mysql/0, mysql/2) did not exhibit
this problem, suggesting they received the hook after mysqld was already running.

This is reproducible in environments where VM provisioning is fast (machines become ACTIVE
within ~2 minutes in this case), giving mysqld less startup time before relation hooks fire.

Note: There was a separate, unrelated Nova MessagingTimeout for machine 4 (data-integrator)
during provisioning. That resolved itself after two retries and is not related to this bug.


What is NOT the cause
----------------------
- Not a timeout/performance issue (the charm actively errors, not just slow)
- Not a resource contention issue on the hypervisor
- Not a network issue (mysql/0 and mysql/2 clustered successfully on the same subnet)
- Not a data-integrator MessagingTimeout issue (that resolved before mysql/1's problem)


Code Analysis (canonical/mysql-operators, commit 6933f80)
---------------------------------------------------------
The bug was traced through the source code. Three files are involved:

1. machines/lib/charms/mysql/v0/tls.py  — TLS relation handler
   lines 92-99:

     def _on_tls_relation_joined(self, event) -> None:
         """Request certificate when TLS relation joined."""
         if not self.charm.unit_initialized():
             event.defer()
             return
         self._request_certificate(None)

   unit_initialized() calls cluster_metadata_exists() which immediately
   tries to open a MySQL socket connection. If mysqld is not running yet
   that raises MySQLClusterMetadataExistsError → caught → returns False →
   event deferred. So this handler itself defers correctly.

   HOWEVER: the ops framework re-emits ALL previously deferred events at
   the start of any new hook. When certificates-relation-joined fires, the
   earlier deferred StartEvent is re-emitted first (same hook context).
   This triggers _on_start on the non-leader unit.

2. machines/src/charm.py  — _on_start handler
   The handler calls workload_initialise() and catches a fixed set of
   exception types:

     try:
         self.workload_initialise()
     except MySQLConfigureMySQLRolesError: ...
     except MySQLConfigureMySQLUsersError: ...
     except MySQLConfigureInstanceError:   ...
     except MySQLCreateCustomMySQLDConfigError: ...
     except MySQLDNotRestartedError:       ...
     except MySQLComponentInstallError:    ...   (warning only)
     except MySQLGetMySQLVersionError:     ...   (debug only)

   MySQLSetOperatorUserAndStartMySQLDError is NOT in this list.
   workload_initialise() calls set_operator_user_and_start_mysqld() which
   can raise MySQLSetOperatorUserAndStartMySQLDError — this would become
   an uncaught exception → "Uncaught exception while in charm code" →
   exit status 1.

3. machines/src/mysql_vm_helpers.py  — is_mysqld_running and friends

   def is_mysqld_running(self) -> bool:
       return os.path.exists(MYSQLD_SOCK_FILE)

   def wait_until_mysql_connection(self, check_port: bool = True) -> None:
       if not os.path.exists(MYSQLD_SOCK_FILE):
           raise MySQLServiceNotRunningError("MySQL socket file not found")
       if check_port and not self.check_mysqlsh_connection():
           raise MySQLServiceNotRunningError(...)

   set_operator_user_and_start_mysqld() calls
   wait_until_mysql_connection(check_port=False), which only checks for
   socket FILE existence — not whether mysqld is actually accepting
   connections. If mysqld creates the socket file but is still
   initializing (processing init_file), subsequent SQL calls fail with
   ERROR 2002.

   Contrast with flush_host_cache(), which HAS the correct guard:

     def flush_host_cache(self) -> None:
         if not self.is_mysqld_running():              # <-- guard
             logger.warning("mysqld is not running, skipping flush host cache")
             return
         ...

   configure_mysql_system_roles() has NO equivalent guard. It calls
   self._instance_client_sock.search_instance_roles() directly, which
   opens a socket connection and fails with ERROR 2002 if mysqld is not
   ready.

Bug summary at the code level
------------------------------
  Proximate cause: MySQLSetOperatorUserAndStartMySQLDError is not handled
  by _on_start, OR configure_mysql_system_roles() connects to the socket
  without verifying mysqld is fully accepting connections, raising an
  exception type not caught by _on_start → hook exits 1.

  Underlying cause: wait_until_mysql_connection(check_port=False) only
  tests socket FILE existence, not actual socket readiness.  Passing
  check_port=True (or adding a separate socket-level connectivity check)
  would block until mysqld is truly ready before allowing SQL operations.

  Second issue: the retry loop that follows the hook failure repeatedly
  calls cluster_metadata_exists() (every ~5s) which tries to connect to
  the not-yet-ready socket, printing "Failed to check if local cluster
  metadata exists" 943 times over 64 minutes without ever recovering.


Artifacts
---------
  Swift object store (solutions-qa container), prefix: d4685121-759f-4f5a-a1c2-5bfac333d2fd/

  Key files:
    generated/juju_openstack_controller/juju-crashdump-controller-2026-03-20-14.08.02.tar.gz
      -> var/log/juju/logsink.log          (all unit charm logs; filter on "423a8d")
      -> var/log/juju/models/admin-mysql-423a8d.log  (provisioner/firewaller log)
    generated/deploy_charm_mysql/juju_status_foundation-openstack_mysql.txt
    generated/deploy_charm_mysql/juju_status_foundation-openstack_mysql.json

  https://solutions.qa.canonical.com/testruns/d4685121-759f-4f5a-a1c2-5bfac333d2fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mysql charm — certificates-relation-joined hook fails when mysqld not yet running #189

Summary

Deployment Configuration

Timeline of Failure

Error Sequence (from juju controller logsink, model 423a8d80)

Juju Status at Timeout (14:05:32Z)

Root Cause Analysis

What is NOT the cause

Code Analysis (canonical/mysql-operators, commit `6933f80`)

Bug summary at the code level

Artifacts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mysql charm — certificates-relation-joined hook fails when mysqld not yet running #189

Description

Summary

Deployment Configuration

Timeline of Failure

Error Sequence (from juju controller logsink, model 423a8d80)

Juju Status at Timeout (14:05:32Z)

Root Cause Analysis

What is NOT the cause

Code Analysis (canonical/mysql-operators, commit 6933f80)

Bug summary at the code level

Artifacts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Code Analysis (canonical/mysql-operators, commit `6933f80`)