"No Endpoints" and "Failed to send address update" Errors with 40 Pods Running #11239

tgolly · 2023-08-13T03:55:04Z

tgolly
Aug 13, 2023

Hi LinkerD community.

I'm struggling with an on-going issue and have run out of ideas on how to debug it.

My cluster has the following versions:

AKS - Kubernetes vs 1.25
LinkerD v2.12 (soon to be v2.13)

About once per day when the cluster is scaling back "in", one of the services (lets call it "service-X") seems to lose all it's LinkerD endpoints, even though there are plenty of pods available.

The linkerd-proxy sidecars of the upstream services that call service-X log errors like this:

[2651858.651406s] INFO ThreadId(01) outbound:proxy{addr=10.0.59.152:80}: linkerd_proxy_api_resolve::resolve: No endpoints
[2651927.732069s] INFO ThreadId(01) outbound:proxy{addr=10.0.59.152:80}:rescue{client.addr=10.80.144.166:45004}: linkerd_app_core::errors::respond: Request failed error=HTTP Balancer service in fail-fast

This lasts about 3-7 minutes, then it recovers and everything works again. The problem is during this time while the fail-fast circuit breaker is open, no traffic can be sent to that service.

This usually happens when a kubernetes node is scaling "in", so one of service-X's pods need to move from the draining node to another node. Pod draining works fine for all the other services, but occasionally all the pods talking to service-X go into "fail-fast" mode and I can't see why.

I tried enabling access logs and increased the log-level in the proxy, but this didn't provide much insight. I also increased the minReplicas of the HPA for the service to 40, to ensure when a node is draining, there are still plenty of other running pods in the cluster. This also hasn't helped. Usually this happens when just a single pod is moving from one node to another, and almost always on a scale-in event.

One thing I've noticed is "Failed to send address update" errors in the linkerd-destination pod for service-X, but I'm not sure what causes these errors:

time="2023-08-13T02:04:18Z" level=error msg="Failed to send address update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=endpoint-translator remote="10.80.145.132:36752" service="service-x.ns.svc.cluster.local:80"
time="2023-08-13T02:24:28Z" level=error msg="Failed to send address update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=endpoint-translator remote="10.80.145.132:36752" service="service-x.ns.svc.cluster.local:80"
time="2023-08-13T02:24:46Z" level=error msg="Failed to send address update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=endpoint-translator remote="10.80.145.132:36752" service="service-x.ns.svc.cluster.local:80"
time="2023-08-13T02:27:04Z" level=error msg="Failed to send address update: rpc error: code = Canceled desc = context canceled" addr=":8086" component=endpoint-translator remote="10.80.145.132:36752" service="service-x.ns.svc.cluster.local:80"

My questions are:

What else can I do to debug these "No Endpoints" errors in the proxies?
And what do "Failed to send address update" RPC errors in the control plane logs mean, can they be fixed?

Any suggestions to debug this will be very much appreciated.

Cheers, Tim.

Answered by tgolly

Aug 28, 2023

Thanks @alpeb . Yes upgrading to 2.13.x fixes the issue, we were hitting #10370 .

View full answer

tgolly · 2023-08-18T14:00:38Z

tgolly
Aug 18, 2023
Author

I turned on loglevel=debug on the linkerd-destination service and I see these messages:

"Establishing watch on endpoint [olo/service-x:80]" addr=":8086" component=endpoints-watcher
"NoEndpoints(true)" addr=":8086" component=endpoint-translator remote="10.80.145.133:46550" service="service-x.olo.svc.cluster.local:80"
"Sending destination no endpoints: no_endpoints:{exists:true}" addr=":8086" component=endpoint-translator remote="10.80.145.133:46550" service="service-x.olo.svc.cluster.local:80"

Under what conditions would it consider service-x to have NoEndpoints(true), when there are plenty of pods running?

0 replies

alpeb · 2023-08-18T17:14:23Z

alpeb
Aug 18, 2023
Collaborator

Thanks for the detailed report @tgolly. There have been lots of improvements in the Destination controller since 2.12, so do you mind giving it a try with 2.13 and report back? 2.14 is also coming out next week, but in case you wanna try that as well, you'd have to upgrade first to 2.13 and then to 2.14.

1 reply

tgolly Aug 28, 2023
Author

Thanks @alpeb . Yes upgrading to 2.13.x fixes the issue, we were hitting #10370 .

Answer selected by tgolly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"No Endpoints" and "Failed to send address update" Errors with 40 Pods Running #11239

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

"No Endpoints" and "Failed to send address update" Errors with 40 Pods Running #11239

Uh oh!

tgolly Aug 13, 2023

Replies: 2 comments · 1 reply

Uh oh!

tgolly Aug 18, 2023 Author

Uh oh!

alpeb Aug 18, 2023 Collaborator

Uh oh!

tgolly Aug 28, 2023 Author

tgolly
Aug 13, 2023

Replies: 2 comments 1 reply

tgolly
Aug 18, 2023
Author

alpeb
Aug 18, 2023
Collaborator

tgolly Aug 28, 2023
Author