-
Notifications
You must be signed in to change notification settings - Fork 34
Fetch up-to-date gcomm members list during a failover #348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch up-to-date gcomm members list during a failover #348
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dciabrin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
624dc9e to
2d0fb48
Compare
One chainsaw test consists in abruptly cutting one galera node away from the galera cluster and verify that the active endpoint moves to one of the remaining two galera instances. In doing so, we currently kill -9 the target mysqld server. By design, this can take by default up to 15s for the remaining galera nodes to acknowlege the node went away and react to that. This is a problem for the test as if the pod comes back online before the 15s, the galera cluster won't move the endpoint and the test will fail. To prevent flaky result in the unit test, use the STOP signal instead of the KILL signal. This doesn't kill the pod, and by default galera will mark the node as not responding after 3s, and switch the endpoint. This achieves the same result, which is to make sure that an unexpected disconnection still trigger a endpoint switch.
The operator script that implements service endpoint failover contains internal logic to probe the up-to-date state of the gcomm cluster. This is done when the script starts, or when a command failed and is retried. The list of members was incorrectly extracted from a mysql table which is not guaranteed to be up-to-date when e.g. a node disappears from the cluster due to a network partition. Instead, we must rely on the mysql status, that always exposes the up-to-date gcomm state, in particular the members that are still connected to the primary partition. Jira: OSPRH-18408
2d0fb48 to
bfdcd01
Compare
|
/retest |
|
/lgtm |
65de595
into
openstack-k8s-operators:main
|
/cherry-pick 18.0-fr3 |
|
@lmiccini: new pull request created: #349 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The operator script that implements service endpoint failover
contains internal logic to probe the up-to-date state of the
gcomm cluster. This is done when the script starts, or when
a command failed and is retried.
The list of members was incorrectly extracted from a mysql
table which is not guaranteed to be up-to-date when e.g.
a node disappears from the cluster due to a network partition.
Instead, we must rely on the mysql status, that always exposes
the up-to-date gcomm state, in particular the members that
are still connected to the primary partition.
Jira: OSPRH-18408