-
Hi LinkerD community. I'm struggling with an on-going issue and have run out of ideas on how to debug it. My cluster has the following versions:
About once per day when the cluster is scaling back "in", one of the services (lets call it "service-X") seems to lose all it's LinkerD endpoints, even though there are plenty of pods available. The linkerd-proxy sidecars of the upstream services that call service-X log errors like this:
This lasts about 3-7 minutes, then it recovers and everything works again. The problem is during this time while the fail-fast circuit breaker is open, no traffic can be sent to that service. This usually happens when a kubernetes node is scaling "in", so one of service-X's pods need to move from the draining node to another node. Pod draining works fine for all the other services, but occasionally all the pods talking to service-X go into "fail-fast" mode and I can't see why. I tried enabling access logs and increased the log-level in the proxy, but this didn't provide much insight. I also increased the One thing I've noticed is "Failed to send address update" errors in the
My questions are:
Any suggestions to debug this will be very much appreciated. Cheers, Tim. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
I turned on
Under what conditions would it consider service-x to have NoEndpoints(true), when there are plenty of pods running? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the detailed report @tgolly. There have been lots of improvements in the Destination controller since 2.12, so do you mind giving it a try with 2.13 and report back? 2.14 is also coming out next week, but in case you wanna try that as well, you'd have to upgrade first to 2.13 and then to 2.14. |
Beta Was this translation helpful? Give feedback.
Thanks @alpeb . Yes upgrading to 2.13.x fixes the issue, we were hitting #10370 .