You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Properly handle reconnections in connection manager (#3289)
Fixes#3288
We see tons of timer errors for clauses not existing in deploys on
cloud, see issue above.
I've noticed that for connection pools, we only allow starting them if
the pool pids are set to none, which is only ever true on the first try,
so the reconnection logic never really worked for them.
I've made it so that we initialise `pool_pids` with admin and snapshot
and start the appropriate ones.
I've also made it so that we only attempt reconnections if we fail
during one of the setup steps of the process that actually failed, cause
the logic seemed to basically retry the current step on _any_ connection
error, so the pools might fail and it might retry the replication
client.
The assumption there was that things would only ever exit and reconnect
during their connection step, but the reality is that they can exit with
a connection error _at any point_ during the setup.
I continue to believe that our connection manager is a liability right
now because it's far from fully mapped out.
I haven't added tests for this issue which is really annoying, but
opening this PR regardless to move work on this issue. Feel free to
contribute to it.
edit: I've manually reproduced the error by sporadically killing the
pools after they are ready and observing startup - this fix indeed works
and if the pools crash after having become ready all of the connection
manager is brought down but in a graceful way.
0 commit comments