feat: [DPE-7404] promote to primary on unit scope (#646)

paulomach · carlcsaposs-canonical · web-flow · commit 28a3bd77f4ed · 2025-08-26T09:54:01.000-03:00
* port promotion to primary to k8s charm

* lib sync from VM

* test wait for failure scenario

* please new linting rules and libs bump

* missing one

* add placeholder function for followup PR

* remove placeholder

* locking capabilities for unit rejoin

* update parameters

* fix dependency build with pinned version

* merge leftover

* git checkout origin/main -- poetry.lock &amp;&amp; poetry lock

* Include docs

---------

Co-authored-by: Carl Csaposs &lt;carl.csaposs@canonical.com&gt;
diff --git a/actions.yaml b/actions.yaml
@@ -27,6 +27,7 @@ set-password:
       type: string
       description: The username, the default value 'root'.
         Possible values - root, serverconfig, clusteradmin.
+      enum: [root, serverconfig, clusteradmin]
     password:
       type: string
       description: The password will be auto-generated if this option is not specified.
@@ -77,15 +78,24 @@ create-replication:
 
 promote-to-primary:
   description: |
-    Promotes this cluster to become the primary in the cluster-set. Used for safe switchover or failover.
-    Can only be run against the charm leader unit of a standby cluster.
+    Promotes the unit or cluster to become the primary in the cluster or cluster-set, depending on
+    the scope (unit or cluster). Used for safe switchover or failover.
+    When in cluster scope, can only be run against the charm leader unit of a standby cluster.
   params:
+    scope:
+      type: string
+      description: Whether to promote a unit or a cluster. Must be set to either `unit` or `cluster`.
+      enum: [unit, cluster]
     force:
       type: boolean
       default: False
       description: |
-        Use force when previous primary is unreachable (failover). Will invalidate previous
+        For cluster scope, use force when previous primary is unreachable (failover). Will invalidate previous
         primary.
+        For unit scope, use force to force quorum from the current unit. Note that this operation is DANGEROUS
+        as it can create a split-brain if incorrectly used and should be considered a last resort. Make
+        absolutely sure that there are no partitions of this group that are still operating somewhere in
+        the network, but not accessible from your location
 
 recreate-cluster:
   description: |
diff --git a/docs/how-to/cross-regional-async-replication/switchover-failover.md b/docs/how-to/cross-regional-async-replication/switchover-failover.md
@@ -9,7 +9,7 @@ Make sure both `Rome` and `Lisbon` Clusters are deployed using the [Async Deploy
 Assuming `Rome` is currently `Primary` and you want to promote `Lisbon` to be new primary:
 
 ```shell
-juju run -m lisbon db2/leader promote-to-primary 
+juju run -m lisbon db2/leader promote-to-primary scope=cluster
 ```
 
 `Rome` will be converted to `StandBy` member.
@@ -25,9 +25,10 @@ It should ONLY be executed if Primary cluster is no longer exist (i.e. it is los
 Assuming `Rome` was a `Primary` (before we lost the cluster `Rome`) and you want to promote `Lisbon` to be the new primary:
 
 ```shell
-juju run -m lisbon db2/leader promote-to-primary force=True
+juju run -m lisbon db2/leader promote-to-primary scope=cluster force=True
 ```
 
 ```{caution}
 `force=True` will cause the old primary to be invalidated.
-```
+```
+
diff --git a/docs/how-to/index.md b/docs/how-to/index.md
@@ -24,6 +24,7 @@ Scale replicas <scale-replicas>
 Manage passwords <manage-passwords>
 Enable TLS <enable-tls>
 External network access <external-network-access>
+Primary switchover <primary-switchover>
 ```
 
 ## Back up and restore
@@ -79,4 +80,5 @@ Development <development/index>
 :hidden:
 
 Contribute <contribute>
-```
+```
+
diff --git a/docs/how-to/primary-switchover.md b/docs/how-to/primary-switchover.md
@@ -0,0 +1,20 @@
+# How to do a primary switchover
+
+A user may want to change the primary in a MySQL cluster to improve
+performance, enable maintenance, recover from failure, or balance load across
+nodes.
+
+On a healthy cluster, the primary can be changed by running the `promote-to-primary` action with
+parameter `scope` set to `unit` on the unit that should become the new primary.
+
+```shell
+juju run-action mysql/1 promote-to-primary scope=unit
+```
+
+In this example, the unit `mysql/1` will become the new primary. The previous primary will become a
+secondary.
+
+```{caution}
+The `promote-to-primary` action can be used in cluster scope, when using async replication.
+Check [Switchover / Failover](cross-regional-async-replication/switchover-failover) for more information.
+```
diff --git a/docs/reference/troubleshooting/index.md b/docs/reference/troubleshooting/index.md
@@ -10,7 +10,7 @@ See [](/reference/troubleshooting/known-scenarios.md) for specific operational i
 
 ## Check status
 
-The first troubleshooting step is to run `juju status` and check the statuses and messages of all applications and units. 
+The first troubleshooting step is to run `juju status` and check the statuses and messages of all applications and units.
 
 See [](/reference/charm-statuses) for additional recommendations based on status.
 
@@ -47,7 +47,7 @@ See [Juju logs documentation](https://juju.is/docs/juju/log) to learn more about
 
 Check the operator [architecture](/explanation/architecture) first to be familiar with the `charm` and `workload` containers.
 
-Make sure both containers are `Running` and `Ready` to continue troubleshooting inside the charm. 
+Make sure both containers are `Running` and `Ready` to continue troubleshooting inside the charm.
 
 To describe the running pod, use the following command (where `0` is a Juju unit id):
 
@@ -99,6 +99,7 @@ To enter the `workload` container, run:
 ```shell
 juju ssh --container mysql mysql-k8s/0 bash
 ```
+
 You can check the list of running processes and Pebble plan:
 
 ```shell
@@ -114,7 +115,7 @@ mysql         70  0.0  0.0   2888  1884 ?        S    21:14   0:00 /bin/sh /usr/
 mysql        366  2.4  7.2 26711784 2394252 ?    Sl   21:14   0:10 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --log-error=/var/log/mysql/error.log --pid-file=mysql-k8s-0.pid
 ```
 
-The list of running Pebble services will dependson whether the charm is integrated with [COS](/how-to/monitoring-cos/enable-monitoring) and/or has [backup](/how-to/back-up-and-restore/create-a-backup) functionality. 
+The list of running Pebble services will dependson whether the charm is integrated with [COS](/how-to/monitoring-cos/enable-monitoring) and/or has [backup](/how-to/back-up-and-restore/create-a-backup) functionality.
 
 The Pebble and its service `mysqld_safe` must always be enabled and currently running (the Linux processes `pebble`, `mysqld_safe` and `mysqld`).
 
@@ -159,7 +160,7 @@ Continue troubleshooting your database/SQL related issues from here.
 
 [Contact us](/reference/contacts) if you cannot determinate the source of your issue, or if you'd like to help us improve this document.
 
-## Installing extra software:
+## Installing extra software
 
 **We do not recommend installing any additionally software** as it may affect the stability and produce anomalies which is hard to troubleshoot and fix.
 
@@ -178,4 +179,5 @@ root@mysql-k8s-0:/#
 :titlesonly:
 
 Known scenarios <known-scenarios>
+Recovering from quorum loss <recover-from-quorum-loss>
 ```
diff --git a/docs/reference/troubleshooting/recover-from-quorum-loss.md b/docs/reference/troubleshooting/recover-from-quorum-loss.md
@@ -0,0 +1,101 @@
+# Recovering from quorum loss
+
+Quorum loss in MySQL happens when the majority of nodes (the quorum) required to make decisions and
+maintain consistency is no longer available. This can happen due to network issues, node failures,
+or other disruptions. When this occurs, the cluster may become unavailable or enter a read-only
+state.
+
+Although the charm cannot automatically recover from quorum loss, you can take the following steps
+to manually recover the cluster.
+
+```{warning}
+Recovery from quorum loss should be performed with caution, as it can impact the availability and
+cause loss of data.
+```
+
+## Ensure the cluster is in no-quorum state
+
+A quorum loss will typically look like this in the juju status output:
+
+```
+Model   Controller  Cloud/Region             Version  SLA          Timestamp
+mymodel localhost   default                  3.6.8    unsupported  17:52:19Z
+
+App    Version                  Status   Scale  Charm      Channel           Rev  Address        Exposed  Message
+mysql  8.0.42-0ubuntu0.22.04.2  waiting      3  mysql-k8s  8.0/edge          279  10.152.183.61  no       waiting for units to settle down
+
+Unit      Workload     Agent  Address     Ports  Message
+mysql/0*  maintenance  idle   10.1.2.48          offline
+mysql/1   maintenance  idle   10.1.0.195         offline
+mysql/2   active       idle   10.1.1.81
+```
+
+From an active unit, check the cluster status with:
+
+```shell
+juju run mysql/2 get-cluster-status
+```
+
+Which will output the current status of the cluster.
+
+```
+Running operation 17 with 1 task
+  - task 18 on unit-mysql-2
+
+Waiting for task 18...
+status:
+  clustername: cluster-3eab807dee6797402ecfc52b5a84d15b
+  clusterrole: primary
+  defaultreplicaset:
+    name: default
+    primary: mysql-0.mysql-endpoints.m3.svc.cluster.local.:3306
+    ssl: required
+    status: no_quorum
+    statustext: cluster has no quorum as visible from 'mysql-2.mysql-endpoints.m3.svc.cluster.local.:3306'
+      and cannot process write transactions. 2 members are not active.
+    topology:
+      mysql-0:
+        address: mysql-0.mysql-endpoints.m3.svc.cluster.local.:3306
+        instanceerrors: '[''note: group_replication is stopped.'']'
+        memberrole: primary
+        memberstate: offline
+        mode: n/a
+        role: ha
+        status: unreachable
+        version: 8.0.42
+      mysql-1:
+        address: mysql-1.mysql-endpoints.m3.svc.cluster.local.:3306
+        instanceerrors: '[''note: group_replication is stopped.'']'
+        memberrole: secondary
+        memberstate: offline
+        mode: n/a
+        role: ha
+        status: unreachable
+        version: 8.0.42
+      mysql-2:
+        address: mysql-2.mysql-endpoints.m3.svc.cluster.local.:3306
+        memberrole: secondary
+        mode: r/o
+        replicationlagfromimmediatesource: ""
+        replicationlagfromoriginalsource: ""
+        role: ha
+        status: online
+        version: 8.0.42
+    topologymode: single-primary
+  domainname: cluster-set-3eab807dee6797402ecfc52b5a84d15b
+  groupinformationsourcemember: mysql-2.mysql-endpoints.m3.svc.cluster.local.:3306
+success: "True"
+```
+
+Note from the output, we can see that the cluster is in a no-quorum state, with `status:
+no_quorum`.
+
+## Recover the cluster from the active unit
+
+Using the available active unit, run the action:
+
+```shell
+juju run mysql/2 promote-to-primary scope=unit force=true
+```
+
+The unit will become the new primary. Other offline units, if reachable, will rejoin automatically on the follow up `update-status` events.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -67,6 +67,7 @@ kubernetes = "^27.2.0"
 allure-pytest = "^2.13.2"
 allure-pytest-default-results = "^0.1.2"
 pytest-asyncio = "^0.21.1"
+jubilant = "^1.0.1"
 
 [tool.coverage.run]
 branch = true
diff --git a/src/charm.py b/src/charm.py
@@ -42,6 +42,7 @@
     MySQLLockAcquisitionError,
     MySQLNoMemberStateError,
     MySQLRebootFromCompleteOutageError,
+    MySQLRejoinInstanceToClusterError,
     MySQLServiceNotRunningError,
     MySQLSetClusterPrimaryError,
     MySQLUnableToGetMemberStateError,
@@ -318,10 +319,6 @@ def text_logs(self) -> list:
 
         return text_logs
 
-    def update_endpoints(self) -> None:
-        """Temp placeholder."""
-        pass
-
     def unit_initialized(self, raise_exceptions: bool = False) -> bool:
         """Return whether a unit is started.
 
@@ -948,7 +945,7 @@ def _execute_manual_rejoin(self) -> None:
         It is supposed to be called when the MySQL 8.0.21+ auto-rejoin attempts have been exhausted,
         on an OFFLINE replica that still belongs to the cluster
         """
-        if not self._mysql.is_instance_in_cluster(self.unit_label):
+        if not self._mysql.instance_belongs_to_cluster(self.unit_label):
             logger.warning("Instance does not belong to the cluster. Cannot perform manual rejoin")
             return
 
@@ -957,15 +954,38 @@ def _execute_manual_rejoin(self) -> None:
             logger.warning("Instance does not have ONLINE peers. Cannot perform manual rejoin")
             return
 
+        # add random delay to mitigate collisions when multiple units are rejoining
+        # due the difference between the time we test for locks and acquire them
+        # Not used for cryptographic purpose
+        sleep(random.uniform(0, 1.5))  # noqa: S311
+
+        if self._mysql.are_locks_acquired(from_instance=cluster_primary):
+            logger.info("waiting: cluster lock is held")
+            return
+        try:
+            self._mysql.rejoin_instance_to_cluster(
+                unit_address=self.unit_address,
+                unit_label=self.unit_label,
+                from_instance=cluster_primary,
+            )
+        except MySQLRejoinInstanceToClusterError:
+            logger.warning("Can't rejoin instance to cluster. Falling back to remove and add.")
+
         self._mysql.remove_instance(
             unit_label=self.unit_label,
+            auto_dissolve=False,
         )
         self._mysql.add_instance_to_cluster(
             instance_address=self.unit_address,
             instance_unit_label=self.unit_label,
             from_instance=cluster_primary,
         )
 
+    def update_endpoints(self) -> None:
+        """Update the endpoints for the database relation."""
+        self.database_relation._configure_endpoints(None)
+        self._on_update_status(None)
+
     def _is_cluster_blocked(self) -> bool:
         """Performs cluster state checks for the update-status handler.
 
@@ -1031,6 +1051,8 @@ def _set_app_status(self) -> None:
             return
 
         if not primary_address:
+            logger.error("Cluster has no primary. Check cluster status on online units.")
+            self.app.status = MaintenanceStatus("Cluster has no primary.")
             return
 
         if "s3-block-message" in self.app_peer_data:
diff --git a/tests/integration/conftest.py b/tests/integration/conftest.py
diff --git a/tests/integration/high_availability/test_primary_switchover.py b/tests/integration/high_availability/test_primary_switchover.py
diff --git a/tests/spread/test_primary_switchover.py/task.yaml b/tests/spread/test_primary_switchover.py/task.yaml