Skip to content

Conversation

@dciabrin
Copy link
Contributor

@dciabrin dciabrin commented Jul 18, 2025

The operator script that implements service endpoint failover
contains internal logic to probe the up-to-date state of the
gcomm cluster. This is done when the script starts, or when
a command failed and is retried.

The list of members was incorrectly extracted from a mysql
table which is not guaranteed to be up-to-date when e.g.
a node disappears from the cluster due to a network partition.

Instead, we must rely on the mysql status, that always exposes
the up-to-date gcomm state, in particular the members that
are still connected to the primary partition.

Jira: OSPRH-18408

@openshift-ci openshift-ci bot requested review from dprince and lewisdenny July 18, 2025 14:51
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dciabrin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dciabrin dciabrin force-pushed the failover-gcomm-members branch 3 times, most recently from 624dc9e to 2d0fb48 Compare July 24, 2025 12:10
dciabrin added 3 commits July 24, 2025 16:46
One chainsaw test consists in abruptly cutting one galera node away from the
galera cluster and verify that the active endpoint moves to one of the
remaining two galera instances.

In doing so, we currently kill -9 the target mysqld server. By design,
this can take by default up to 15s for the remaining galera nodes to
acknowlege the node went away and react to that. This is a problem for
the test as if the pod comes back online before the 15s, the galera
cluster won't move the endpoint and the test will fail.

To prevent flaky result in the unit test, use the STOP signal instead
of the KILL signal. This doesn't kill the pod, and by default galera
will mark the node as not responding after 3s, and switch the endpoint.

This achieves the same result, which is to make sure that an unexpected
disconnection still trigger a endpoint switch.
The operator script that implements service endpoint failover
contains internal logic to probe the up-to-date state of the
gcomm cluster. This is done when the script starts, or when
a command failed and is retried.

The list of members was incorrectly extracted from a mysql
table which is not guaranteed to be up-to-date when e.g.
a node disappears from the cluster due to a network partition.

Instead, we must rely on the mysql status, that always exposes
the up-to-date gcomm state, in particular the members that
are still connected to the primary partition.

Jira: OSPRH-18408
@dciabrin dciabrin force-pushed the failover-gcomm-members branch from 2d0fb48 to bfdcd01 Compare July 24, 2025 15:40
@dciabrin
Copy link
Contributor Author

/retest
unrelated failure prior to tests in ci/prow/mariadb-operator-build-deploy

@lmiccini
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 25, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 65de595 into openstack-k8s-operators:main Jul 25, 2025
8 checks passed
@lmiccini
Copy link

/cherry-pick 18.0-fr3

@openshift-cherrypick-robot

@lmiccini: new pull request created: #349

In response to this:

/cherry-pick 18.0-fr3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants