-
Notifications
You must be signed in to change notification settings - Fork 34
Rework retry/timeout defaults to ensure fast service failover #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework retry/timeout defaults to ensure fast service failover #337
Conversation
|
/retest-required |
. Improve the teardown of every test, so that KUTTL can run the tests in a random order without causing errors due to unexpected resource state. . Improve account and database creation tests so that they can be run from the top-most directory without causing KUTTL errors. . Also remove a test that expects the mariadb-operator runs in a pod on a dedicated namespace. This test doesn't add much coverage and removing it greatly simplifies testing locally during development or CI failure analysis.
|
Added another commit in the PR to fix the KUTTL errors from unit tests |
When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic. If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption. Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available. This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit. Jira: OSPRH-17604
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dciabrin, lmiccini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0b41a4a
into
openstack-k8s-operators:main
|
/cherry-pick 18.0-fr3 |
|
@lmiccini: new pull request created: #338 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic.
If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption.
Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available.
This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit.
Jira: OSPRH-17604