Skip to content

Commit 288f775

Browse files
committed
Rework retry/timeout defaults to ensure fast service failover
When the galera pod that receives database traffic becomes unresponsible, the galera library reacts by running a script in one of the surviving pod to elect a new endpoint. This script uses curl to call the API server to update the selector object responsible for balancing database traffic. If during the API call the API server becomes unresponsive/unreacheable (e.g. the API VIP fails over to another master node), the curl call might get stuck for an unbounded period of time, which delays the traffic failover and can cause a long database service disruption. Add a default connect timeout and update default retry parameters so that curl is never blocked for too long, and the endpoint configuration can be retried until the API server becomes available. This commit only improves the default parameters, the ability to override those parameters will be addressed in a subsequent commit. Jira: OSPRH-17604
1 parent 3febd29 commit 288f775

File tree

1 file changed

+12
-7
lines changed

1 file changed

+12
-7
lines changed

templates/galera/bin/mysql_wsrep_notify.sh

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,14 @@ NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
1111
TOKEN=$(cat ${SERVICEACCOUNT}/token)
1212
CACERT=${SERVICEACCOUNT}/ca.crt
1313

14-
# Retry config
15-
RETRIES=6
16-
WAIT=1
14+
# OSPRH-17604: use default timeout and retry parameters for fast failover
15+
# default parameters for curl calls to the API server
16+
: ${WSREP_NOTIFY_CURL_CONNECT_TIMEOUT:=5}
17+
: ${WSREP_NOTIFY_CURL_MAX_TIME:=30}
18+
CURL="curl --connect-timeout ${WSREP_NOTIFY_CURL_CONNECT_TIMEOUT} --max-time ${WSREP_NOTIFY_CURL_MAX_TIME}"
19+
# defaults parameters for retry on error
20+
: ${WSREP_NOTIFY_RETRIES:=30}
21+
: ${WSREP_NOTIFY_RETRY_WAIT:=1}
1722

1823

1924
##
@@ -66,7 +71,7 @@ function api_server {
6671
request="$request -d @-"
6772
fi
6873
local output
69-
output=$(curl -s --cacert ${CACERT} --header "Content-Type:application/json" --header "Authorization: Bearer ${TOKEN}" --request $request ${APISERVER}/api/v1/namespaces/${NAMESPACE}/services/${service})
74+
output=$(${CURL} -s --cacert ${CACERT} --header "Content-Type:application/json" --header "Authorization: Bearer ${TOKEN}" --request $request ${APISERVER}/api/v1/namespaces/${NAMESPACE}/services/${service})
7075

7176
local rc=$?
7277
if [ $rc != 0 ]; then
@@ -109,8 +114,8 @@ function parse_output {
109114
# Generic retry logic for an action function
110115
function retry {
111116
local action=$1
112-
local retries=$RETRIES
113-
local wait=$WAIT
117+
local retries=$WSREP_NOTIFY_RETRIES
118+
local wait=$WSREP_NOTIFY_RETRY_WAIT
114119
local rc=1
115120

116121
$action
@@ -132,7 +137,7 @@ function retry {
132137
mysql_probe_state reprobe
133138
done
134139
if [ $rc -ne 0 ]; then
135-
log_error "Could not run action after ${RETRIES} tries. Stop retrying."
140+
log_error "Could not run action after ${WSREP_NOTIFY_RETRIES} tries. Stop retrying."
136141
fi
137142
return $rc
138143
}

0 commit comments

Comments
 (0)