-
Couldn't load subscription status.
- Fork 2.6k
Description
Hello. We are seeing a strange behavior with our Redis cache cluster where we occasionally get a burst of timeout errors on one node, and at the same time we see a huge spike in new connections on that node. We're not sure yet in what order those things are happening. The cluster is hosted in AWS, and we usually see a "network bandwidth exceeded" metric firing on the node in question at the same time.
I'm reaching out because the errors have been pretty confounding to debug, so if this behavior makes sense to any contributors it might help steer me in the right direction.
We use a socket_connect_timeout of 1s and a socket_timeout of 100ms. We have retry set to to an exponential backoff with a base of 10ms, a cap of 200ms, but currently only a single retry attempt.
Some questions/thoughts I have:
- Does the retry policy apply to establishing a connection, or just to executing commands?
- Does the client try to re-establish connections when a certain number of errors have been encountered? I'm trying to track down if the new connections we're seeing are a result of the timeouts or vice-versa. If the errors are not causing re-connection, are there any other situations that the client tries to re-connect?