Skip to content

Conversation

@natemollica-nm
Copy link

@natemollica-nm natemollica-nm commented Sep 25, 2024

NET-11126: Excessive Upstream Cluster CPU/Memory Overhead

Configure consul-dataplane sidecar proxies and gateways to use the cluster manager's enable_deferred_cluster_creation feature (set to true) by default to limit upstream cluster initialization overhead when a large number of upstream clusters are available or broadcasted between peered clusters.


Context:

  • Kubernetes/OpenShift clusters hosting consul-k8s service-mesh
  • Service redundancy and failover is configured via the peering connection using exported services.
  • As the scale of consul-k8s clusters increase in mesh service count, and the number of exported services increases, the number of known clusters for initialization also increases for consul-dataplane sidecar proxies to startup.
    • This introduces initial startup CPU and Memory spikes that, if using Kubernetes/OpenShift based resource quota limits, that would introduce unwanted latency and delays, and potentially application container restarts prior to the sidecar coming online fully.

Improvements:

  • Startup dataplane sidecar proxy CPU/Memory utilization
  • Reduced startup times for sidecar proxies
  • Reduced resource consumption at startup, encourages the use of resource quotas and minimizes costs.

Example of Upstream cluster counts being initialized during startup for a production scaled cluster with peering service exports:

thread_local_cluster_manager.worker_0.clusters_inflated: 218
thread_local_cluster_manager.worker_1.clusters_inflated: 218
thread_local_cluster_manager.worker_2.clusters_inflated: 218
thread_local_cluster_manager.worker_3.clusters_inflated: 218

The sidecar container would have to process and register all 218 clusters within it's Envoy configuration prior to coming online fully, whether or not the clusters are required for normal operation.


Consul Dataplane Logs

{"@timestamp":"2024-09-16T13:29:26.535638Z+00:00","@module":"envoy.main","@level":"info","@message":"starting main dispatch loop","thread":31}
{"@timestamp":"2024-09-16T13:29:26.837042Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cds: add 140 cluster(s), remove 0 cluster(s)","thread":31}
{"@timestamp":"2024-09-16T13:30:45.735353Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cds: added/updated 140 cluster(s), skipped 0 unmodified cluster(s)","thread":31}
{"@timestamp":"2024-09-16T13:30:45.735391Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cm init: initializing secondary clusters","thread":31}
{"@timestamp":"2024-09-16T13:30:45.739702Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
---- (message repeat for variable amount of seconds to minutes/cut for brevity) ----
{"@timestamp":"2024-09-16T13:30:45.740925Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740932Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740941Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740949Z+00:00","@module":"envoy.config","@level":"warning","@message":"gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740958Z+00:00","@module":"envoy.upstream","@level":"info","@message":"cm init: all clusters initialized","thread":31}
{"@timestamp":"2024-09-16T13:30:45.740968Z+00:00","@module":"envoy.main","@level":"info","@message":"all clusters initialized. initializing init manager","thread":31}

This PR simply adds the following entry to the bootstrap template, so all instances of dataplane proxies will limit the cluster initialization process and reduce CPU/Memory overhead.

enable_deferred_cluster_creation
(bool) Whether the ClusterManager will create clusters on the worker threads inline during requests. This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.

  "cluster_manager": {
	"enable_deferred_cluster_creation": true
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant