Rework retry/timeout defaults to ensure fast service failover #337

dciabrin · 2025-06-19T14:53:23Z

When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic.

If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption.

Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available.

This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit.

Jira: OSPRH-17604

dciabrin · 2025-06-20T08:11:04Z

/retest-required

. Improve the teardown of every test, so that KUTTL can run the tests in a random order without causing errors due to unexpected resource state. . Improve account and database creation tests so that they can be run from the top-most directory without causing KUTTL errors. . Also remove a test that expects the mariadb-operator runs in a pod on a dedicated namespace. This test doesn't add much coverage and removing it greatly simplifies testing locally during development or CI failure analysis.

dciabrin · 2025-06-20T18:02:16Z

Added another commit in the PR to fix the KUTTL errors from unit tests

When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic. If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption. Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available. This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit. Jira: OSPRH-17604

openshift-ci · 2025-06-23T15:19:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dciabrin, lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dciabrin,lmiccini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lmiccini · 2025-06-23T16:22:31Z

/cherry-pick 18.0-fr3

openshift-cherrypick-robot · 2025-06-23T16:23:08Z

@lmiccini: new pull request created: #338

In response to this:

/cherry-pick 18.0-fr3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot requested review from abays and olliewalsh June 19, 2025 14:53

openshift-ci bot added the approved label Jun 19, 2025

dciabrin force-pushed the OSPRH-17604 branch from 68fe0a9 to 288f775 Compare June 20, 2025 18:01

dciabrin force-pushed the OSPRH-17604 branch from 288f775 to 09df39e Compare June 23, 2025 12:56

lmiccini approved these changes Jun 23, 2025

View reviewed changes

openshift-ci bot assigned lmiccini Jun 23, 2025

openshift-ci bot added the lgtm label Jun 23, 2025

openshift-merge-bot bot merged commit 0b41a4a into openstack-k8s-operators:main Jun 23, 2025
7 checks passed

openshift-cherrypick-robot mentioned this pull request Jun 23, 2025

[18.0-fr3] Rework retry/timeout defaults to ensure fast service failover #338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework retry/timeout defaults to ensure fast service failover #337

Rework retry/timeout defaults to ensure fast service failover #337

Uh oh!

dciabrin commented Jun 19, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

dciabrin commented Jun 20, 2025

Uh oh!

dciabrin commented Jun 20, 2025

Uh oh!

openshift-ci bot commented Jun 23, 2025

Uh oh!

Uh oh!

lmiccini commented Jun 23, 2025

Uh oh!

openshift-cherrypick-robot commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rework retry/timeout defaults to ensure fast service failover #337

Rework retry/timeout defaults to ensure fast service failover #337

Uh oh!

Conversation

dciabrin commented Jun 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dciabrin commented Jun 20, 2025

Uh oh!

dciabrin commented Jun 20, 2025

Uh oh!

openshift-ci bot commented Jun 23, 2025

Uh oh!

Uh oh!

lmiccini commented Jun 23, 2025

Uh oh!

openshift-cherrypick-robot commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dciabrin commented Jun 19, 2025 •

edited by openshift-ci bot

Loading