Fetch up-to-date gcomm members list during a failover #348

dciabrin · 2025-07-18T14:51:46Z

The operator script that implements service endpoint failover
contains internal logic to probe the up-to-date state of the
gcomm cluster. This is done when the script starts, or when
a command failed and is retried.

The list of members was incorrectly extracted from a mysql
table which is not guaranteed to be up-to-date when e.g.
a node disappears from the cluster due to a network partition.

Instead, we must rely on the mysql status, that always exposes
the up-to-date gcomm state, in particular the members that
are still connected to the primary partition.

Jira: OSPRH-18408

openshift-ci · 2025-07-18T14:51:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dciabrin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dciabrin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

One chainsaw test consists in abruptly cutting one galera node away from the galera cluster and verify that the active endpoint moves to one of the remaining two galera instances. In doing so, we currently kill -9 the target mysqld server. By design, this can take by default up to 15s for the remaining galera nodes to acknowlege the node went away and react to that. This is a problem for the test as if the pod comes back online before the 15s, the galera cluster won't move the endpoint and the test will fail. To prevent flaky result in the unit test, use the STOP signal instead of the KILL signal. This doesn't kill the pod, and by default galera will mark the node as not responding after 3s, and switch the endpoint. This achieves the same result, which is to make sure that an unexpected disconnection still trigger a endpoint switch.

The operator script that implements service endpoint failover contains internal logic to probe the up-to-date state of the gcomm cluster. This is done when the script starts, or when a command failed and is retried. The list of members was incorrectly extracted from a mysql table which is not guaranteed to be up-to-date when e.g. a node disappears from the cluster due to a network partition. Instead, we must rely on the mysql status, that always exposes the up-to-date gcomm state, in particular the members that are still connected to the primary partition. Jira: OSPRH-18408

dciabrin · 2025-07-24T16:50:40Z

/retest
unrelated failure prior to tests in ci/prow/mariadb-operator-build-deploy

lmiccini · 2025-07-25T06:51:05Z

/lgtm

lmiccini · 2025-07-25T06:54:34Z

/cherry-pick 18.0-fr3

openshift-cherrypick-robot · 2025-07-25T06:55:12Z

@lmiccini: new pull request created: #349

In response to this:

/cherry-pick 18.0-fr3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot requested review from dprince and lewisdenny July 18, 2025 14:51

openshift-ci bot added the approved label Jul 18, 2025

dciabrin force-pushed the failover-gcomm-members branch 3 times, most recently from 624dc9e to 2d0fb48 Compare July 24, 2025 12:10

dciabrin added 3 commits July 24, 2025 16:46

Regenerate expired certificates for unit tests

1bb2318

dciabrin force-pushed the failover-gcomm-members branch from 2d0fb48 to bfdcd01 Compare July 24, 2025 15:40

openshift-ci bot assigned lmiccini Jul 25, 2025

openshift-ci bot added the lgtm label Jul 25, 2025

openshift-merge-bot bot merged commit 65de595 into openstack-k8s-operators:main Jul 25, 2025
8 checks passed

openshift-cherrypick-robot mentioned this pull request Jul 25, 2025

[18.0-fr3] Fetch up-to-date gcomm members list during a failover #349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fetch up-to-date gcomm members list during a failover #348

Fetch up-to-date gcomm members list during a failover #348

Uh oh!

dciabrin commented Jul 18, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Jul 18, 2025

Uh oh!

dciabrin commented Jul 24, 2025

Uh oh!

lmiccini commented Jul 25, 2025

Uh oh!

Uh oh!

lmiccini commented Jul 25, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fetch up-to-date gcomm members list during a failover #348

Fetch up-to-date gcomm members list during a failover #348

Uh oh!

Conversation

dciabrin commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jul 18, 2025

Uh oh!

dciabrin commented Jul 24, 2025

Uh oh!

lmiccini commented Jul 25, 2025

Uh oh!

Uh oh!

lmiccini commented Jul 25, 2025

Uh oh!

openshift-cherrypick-robot commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dciabrin commented Jul 18, 2025 •

edited

Loading