Skip to content

Commit 09df39e

Browse files
committed
Rework retry/timeout defaults to ensure fast service failover
When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic. If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption. Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available. This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit. Jira: OSPRH-17604
1 parent 3febd29 commit 09df39e

File tree

2 files changed

+19
-7
lines changed

2 files changed

+19
-7
lines changed

templates/galera/bin/mysql_wsrep_notify.sh

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,14 @@ NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
1111
TOKEN=$(cat ${SERVICEACCOUNT}/token)
1212
CACERT=${SERVICEACCOUNT}/ca.crt
1313

14-
# Retry config
15-
RETRIES=6
16-
WAIT=1
14+
# OSPRH-17604: use default timeout and retry parameters for fast failover
15+
# default parameters for curl calls to the API server
16+
: ${WSREP_NOTIFY_CURL_CONNECT_TIMEOUT:=5}
17+
: ${WSREP_NOTIFY_CURL_MAX_TIME:=30}
18+
CURL="curl --connect-timeout ${WSREP_NOTIFY_CURL_CONNECT_TIMEOUT} --max-time ${WSREP_NOTIFY_CURL_MAX_TIME}"
19+
# defaults parameters for retry on error
20+
: ${WSREP_NOTIFY_RETRIES:=30}
21+
: ${WSREP_NOTIFY_RETRY_WAIT:=1}
1722

1823

1924
##
@@ -66,7 +71,7 @@ function api_server {
6671
request="$request -d @-"
6772
fi
6873
local output
69-
output=$(curl -s --cacert ${CACERT} --header "Content-Type:application/json" --header "Authorization: Bearer ${TOKEN}" --request $request ${APISERVER}/api/v1/namespaces/${NAMESPACE}/services/${service})
74+
output=$(${CURL} -s --cacert ${CACERT} --header "Content-Type:application/json" --header "Authorization: Bearer ${TOKEN}" --request $request ${APISERVER}/api/v1/namespaces/${NAMESPACE}/services/${service})
7075

7176
local rc=$?
7277
if [ $rc != 0 ]; then
@@ -109,8 +114,8 @@ function parse_output {
109114
# Generic retry logic for an action function
110115
function retry {
111116
local action=$1
112-
local retries=$RETRIES
113-
local wait=$WAIT
117+
local retries=$WSREP_NOTIFY_RETRIES
118+
local wait=$WSREP_NOTIFY_RETRY_WAIT
114119
local rc=1
115120

116121
$action
@@ -132,7 +137,7 @@ function retry {
132137
mysql_probe_state reprobe
133138
done
134139
if [ $rc -ne 0 ]; then
135-
log_error "Could not run action after ${RETRIES} tries. Stop retrying."
140+
log_error "Could not run action after ${WSREP_NOTIFY_RETRIES} tries. Stop retrying."
136141
fi
137142
return $rc
138143
}

tests/chainsaw/tests/service/chainsaw-test.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,11 @@ spec:
7474
check:
7575
# we dont want "ERROR 1047 (08S01) at line 495: WSREP has not yet prepared node for application use"
7676
(find_first($stdout,'(08S01)') == NULL): true
77+
catch: &catch_logs
78+
- script:
79+
content: |
80+
# get full logs for all pods except copy logs from kolla start
81+
oc logs -n $NAMESPACE --prefix=true --tail=-1 -l galera/name=openstack | grep -v -e ' INFO:'
7782
7883
- name: Service failover on pod crash
7984
description: Check that service is failing over when the current endpoint pod crashes
@@ -97,6 +102,7 @@ spec:
97102
check:
98103
($stdout != $endpoint): true
99104
- script: *no_wsrep_in_failover_check
105+
catch: *catch_logs
100106

101107
- name: No failover on random pod restart
102108
description: Check that service is not impacted when a pod that is not the current endpoint is stopped
@@ -114,3 +120,4 @@ spec:
114120
check:
115121
($stdout == $endpoint): true
116122
- script: *no_wsrep_in_failover_check
123+
catch: *catch_logs

0 commit comments

Comments
 (0)