Skip to content

Commit e670a39

Browse files
fultonjclaude
authored andcommitted
nova_wait_for_compute_service: Add retry logic for transient auth failures
Add configurable retry logic to handle transient OpenShift API authentication failures in the nova_wait_for_compute_service hook playbook. When OpenShift is under load, API authentication can temporarily fail with HTTP 401 "Unauthorized" errors, causing the hook to abort with "NoneType: None" exceptions. This change adds retry logic around 'oc project' commands to handle these transient authentication failures. The retry mechanism uses OC_RETRIES (5 attempts, 30s delay) specifically for OpenShift authentication failures before executing the main business logic. This ensures we can reliably connect to the cluster while allowing the existing RETRIES logic to handle legitimate OpenStack service startup delays. Changes: - Add _oc_retries and _oc_delay variables for authentication retry configuration - Add retry loops around 'oc project' commands in both script blocks - Provide clear logging for authentication retry attempts This prevents costly deployment failures when experiencing temporary OpenShift API authentication issues while preserving appropriate timeouts for service readiness checks. Co-Authored-By: Claude <[email protected]>
1 parent a8b9387 commit e670a39

File tree

1 file changed

+34
-2
lines changed

1 file changed

+34
-2
lines changed

hooks/playbooks/nova_wait_for_compute_service.yml

Lines changed: 34 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,9 @@
1616
_number_of_computes: 0
1717
_retries: 25
1818
_cell_conductor: null
19+
# Retry settings for oc commands to handle transient auth failures
20+
_oc_retries: 5
21+
_oc_delay: 30
1922
environment:
2023
KUBECONFIG: "{{ cifmw_openshift_kubeconfig }}"
2124
PATH: "{{ cifmw_path }}"
@@ -29,14 +32,29 @@
2932
COMPUTES={{ _number_of_computes }}
3033
RETRIES={{ _retries }}
3134
COUNTER=0
32-
oc project {{ namespace }}
35+
OC_RETRIES={{ _oc_retries }}
36+
OC_DELAY={{ _oc_delay }}
37+
38+
# Retry oc project command to handle transient auth failures
39+
oc_retry_counter=0
40+
until oc project {{ namespace }}; do
41+
if [[ "$oc_retry_counter" -ge "$OC_RETRIES" ]]; then
42+
echo "Failed to authenticate with OpenShift after $OC_RETRIES attempts"
43+
exit 1
44+
fi
45+
oc_retry_counter=$[$oc_retry_counter +1]
46+
echo "OpenShift auth failed, retrying in ${OC_DELAY}s (attempt $oc_retry_counter/$OC_RETRIES)"
47+
sleep $OC_DELAY
48+
done
49+
3350
until [ $(oc rsh openstackclient openstack compute service list --service nova-compute -f value | wc -l) -eq "$COMPUTES" ]; do
3451
if [[ "$COUNTER" -ge "$RETRIES" ]]; then
3552
exit 1
3653
fi
3754
COUNTER=$[$COUNTER +1]
3855
sleep 10
3956
done
57+
4058
- name: Run nova-manage discover_hosts and wait for host records
4159
cifmw.general.ci_script:
4260
output_dir: "{{ cifmw_basedir }}/artifacts"
@@ -46,7 +64,21 @@
4664
COMPUTES={{ _number_of_computes | int + 4 }}
4765
RETRIES={{ _retries }}
4866
COUNTER=0
49-
oc project {{ namespace }}
67+
OC_RETRIES={{ _oc_retries }}
68+
OC_DELAY={{ _oc_delay }}
69+
70+
# Retry oc project command to handle transient auth failures
71+
oc_retry_counter=0
72+
until oc project {{ namespace }}; do
73+
if [[ "$oc_retry_counter" -ge "$OC_RETRIES" ]]; then
74+
echo "Failed to authenticate with OpenShift after $OC_RETRIES attempts"
75+
exit 1
76+
fi
77+
oc_retry_counter=$[$oc_retry_counter +1]
78+
echo "OpenShift auth failed, retrying in ${OC_DELAY}s (attempt $oc_retry_counter/$OC_RETRIES)"
79+
sleep $OC_DELAY
80+
done
81+
5082
until [ $(oc rsh {{ _cell_conductor }} nova-manage cell_v2 list_hosts | wc -l) -eq "$COMPUTES" ]; do
5183
if [[ "$COUNTER" -ge "$RETRIES" ]]; then
5284
exit 1

0 commit comments

Comments
 (0)