Improve flaky chainsaw test for service failover

dciabrin · dciabrin · commit eaf1d1761c71 · 2025-07-24T14:01:22.000+02:00
One chainsaw test consists in abruptly cutting one galera node away from the
galera cluster and verify that the active endpoint moves to one of the
remaining two galera instances.

In doing so, we currently kill -9 the target mysqld server. By design,
this can take by default up to 15s for the remaining galera nodes to
acknowlege the node went away and react to that. This is a problem for
the test as if the pod comes back online before the 15s, the galera
cluster won't move the endpoint and the test will fail.

To prevent flaky result in the unit test, use the STOP signal instead
of the KILL signal. This doesn't kill the pod, and by default galera
will mark the node as not responding after 3s, and switch the endpoint.

This achieves the same result, which is to make sure that an unexpected
disconnection still trigger a endpoint switch.
diff --git a/tests/chainsaw/tests/service/chainsaw-test.yaml b/tests/chainsaw/tests/service/chainsaw-test.yaml
@@ -91,7 +91,7 @@ spec:
         content: |
           oc wait -n $NAMESPACE --for=jsonpath='{.status.readyReplicas}'=3 statefulset openstack-galera
           current=$ENDPOINT
-          oc rsh -n $NAMESPACE $ENDPOINT killall -9 /usr/libexec/mysqld
+          oc rsh -n $NAMESPACE $ENDPOINT killall -s STOP /usr/libexec/mysqld
           while [ "$current" = "$ENDPOINT" ]; do
             echo $(date) "$current" "$ENDPOINT"
             sleep 1