You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, this is a continuation of this slack thread (The slack thread is now close to 90 days old and will soon be deleted).
We are seeing memory usage spikes on edge-25.1.1 , and previously on 2.14.10 , “linkerd_reconnect: Failed to connect” error logs are showing, but infrequent.
We are seeing destination memory usage scaling closely correlating with general workload scaling (see screenshot showing destination memory usage on top and cluster node count at the bottom).
Usage from linkerd-proxy and policy remains containers generally stable.
Usage from the linkerd-proxy-injector deployment does also scale in similar ways, though I believe that is expected.
See this 12-hour sample window:
Memory usage for 3 linkerd-destination replicas (biggest changes is observed on destination, the linkerd-proxy contrainers also scale somewhat proportionally, but with much smaller spikes (like 9MiB to 9.6MiB and back down to 9MiB)):
There is some churn to match, see node count and running pod count over the same time window (majority of this churn is Jenkins agents):
Out of those pods, only about 9 are meshed (not counting linkerd control plane/multicluster/viz), and that number stays constant. The pods are 7 grafana single-replica deployments, 1 thanos-query deployment with 2 replicas which is used as a Grafana data source.
Is this linkerd-destination memory usage pattern expected and am I misunderstanding scope of linkerd-destination in the stack? Architecture docs state:
The destination service is used by the data plane proxies to determine various aspects of their behavior. It is used to fetch service discovery information (i.e. where to send a particular request and the TLS identity expected on the other end); to fetch policy information about which types of requests are allowed; to fetch service profile information used to inform per-route metrics, retries, and timeouts; and more.
We would not expect traffic through those meshed pods to also scale with the pod churn (at least not that consistently), and those pods should not be producing traffic to/from meshed pods as Jenkins agents and prometheus are not meshed.
I have read through the other linkerd-destination memory issue discussions and issues but did not see anything applicable to us on this Linkerd version: #12924, #11129, #12104, #11315, #9947, #8270, #5939
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, this is a continuation of this slack thread (The slack thread is now close to 90 days old and will soon be deleted).
We are seeing memory usage spikes on
edge-25.1.1
, and previously on2.14.10
, “linkerd_reconnect: Failed to connect” error logs are showing, but infrequent.We are seeing
destination
memory usage scaling closely correlating with general workload scaling (see screenshot showing destination memory usage on top and cluster node count at the bottom).Usage from
linkerd-proxy
andpolicy
remains containers generally stable.Usage from the linkerd-proxy-injector deployment does also scale in similar ways, though I believe that is expected.
See this 12-hour sample window:
destination
, thelinkerd-proxy
contrainers also scale somewhat proportionally, but with much smaller spikes (like 9MiB to 9.6MiB and back down to 9MiB)):Out of those pods, only about 9 are meshed (not counting linkerd control plane/multicluster/viz), and that number stays constant. The pods are 7 grafana single-replica deployments, 1 thanos-query deployment with 2 replicas which is used as a Grafana data source.
Is this linkerd-destination memory usage pattern expected and am I misunderstanding scope of linkerd-destination in the stack? Architecture docs state:
We would not expect traffic through those meshed pods to also scale with the pod churn (at least not that consistently), and those pods should not be producing traffic to/from meshed pods as Jenkins agents and prometheus are not meshed.
I have read through the other linkerd-destination memory issue discussions and issues but did not see anything applicable to us on this Linkerd version: #12924, #11129, #12104, #11315, #9947, #8270, #5939
Beta Was this translation helpful? Give feedback.
All reactions