Skip to content

Conversation

@dciabrin
Copy link
Contributor

@dciabrin dciabrin commented Jun 19, 2025

When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic.

If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption.

Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available.

This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit.

Jira: OSPRH-17604

@openshift-ci openshift-ci bot requested review from abays and olliewalsh June 19, 2025 14:53
@dciabrin
Copy link
Contributor Author

/retest-required

. Improve the teardown of every test, so that KUTTL can run the
  tests in a random order without causing errors due to unexpected
  resource state.

. Improve account and database creation tests so that they can be
  run from the top-most directory without causing KUTTL errors.

. Also remove a test that expects the mariadb-operator runs in a pod
  on a dedicated namespace. This test doesn't add much coverage
  and removing it greatly simplifies testing locally during
  development or CI failure analysis.
@dciabrin
Copy link
Contributor Author

Added another commit in the PR to fix the KUTTL errors from unit tests

When the galera pod that receives database traffic becomes
unresponsible, the galera library reacts by running a script
in one of the surviving pod to elect a new endpoint. This
script uses curl to call the API server to update the selector
object responsible for balancing database traffic.

If during the API call the API server becomes unresponsive/unreacheable
(e.g. the API VIP fails over to another master node), the curl call
might get stuck for an unbounded period of time, which delays the
traffic failover and can cause a long database service disruption.

Add a default connect timeout and update default retry parameters
so that curl is never blocked for too long, and the endpoint
configuration can be retried until the API server becomes available.

This commit only improves the default parameters, the ability to override
those parameters will be addressed in a subsequent commit.

Jira: OSPRH-17604
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 23, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dciabrin, lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 0b41a4a into openstack-k8s-operators:main Jun 23, 2025
7 checks passed
@lmiccini
Copy link

/cherry-pick 18.0-fr3

@openshift-cherrypick-robot

@lmiccini: new pull request created: #338

In response to this:

/cherry-pick 18.0-fr3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants