-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Description
A manual scaling of the 3 gateways is supported. However, that requires introspection of the gateway health by the user which requires advanced knowledge and will not be done auto-magically on-demand. In best case the gateways would scale up and down automatically dependent on the load, saving potentially resources as well.
In the most simple form the gateway could be scaled by memory using an HPA. Here, just the proper tuning needs to be found and an HPA will be managed by the operator. However, typically the collector should be scaled on base of incoming requests as criteria. Also it should not be scaled on any problems with the backend like backpressure, see https://opentelemetry.io/docs/collector/scaling/. So a better approach is to manage the scaling based on metrics. This could be possible using k8s mechanisms by using prometheus and the prometheus-adapter or keda to feed the HPA controller with custom metrics. However, that will complicate the setup a lot.
Another approach of solving the scalability problems, is to give up the concept of a central gateway and switch to a daemonset approach, where an instance is available per node and with that supports a natural scaling with the load. Such approach will have multiple other aspects like removing istio from the picture for load-balancing across gateway replicas. Combined with a VPA, this setup can bring many advantages over a centralized gateway approach with a complex autoscaling mechanism.
Goal
Have autoscaling of the gateway in place so that the user don't need to gain knowledge about when to scale manually. The scaling should be kept simple but feeding the purpose.
Reasons
It should not be the users concern on when to scale up or down
Tasks:
- Decouple the metric agent from the gateway (Decouple the metric agent from the metric gateway #1475)
- Decide on the approach of how to scale the gateway regarding (https://github.com/kyma-project/telemetry-manager/blob/main/docs/contributor/arch/019-switch-from-gateways-to-a-central-agent.md) by looking into these aspects:
- performance (scaling of gateways decision - performance #2656)
- daemonset rollout (Scaling of gateways decision - rollout #2710)
- routing (Scaling of gateways decision - routing #2711)
- Convert the gateway into a Daemonset
- introduce feature flag and add new daemonset for logs only, having istio sidecar only for outgoing traffic ([Daemonset]: First daemonset version behind feature flag for logs #2748)
- Design a fitting reconciler architecture (DaemonSet: Reconciler Architecture #2944)
- Implement TracePipeline components for the central OTLP gateway architecture DaemonSet: Implement TracePipeline components for the central OTLP gateway architecture #3006
- Finalize the OTLP gateway by adding support for the metric gateway [DaemonSet] add support for the metric gateway #2921
- Switch services to new daemonset and have node-local traffic only
- Implement legacy resources cleanup temporary logic during reconciliation (ex: gateway Deployments)
- Adjust the self-monitor and health checks to support both scenarios
- Adjust the tests so that both scenarios are testable
- Adjust the load tests
- Deprecate the manual scaling options, having no effect anymore, remove from busola views, also print a warning via validatingwebhook?
- Adjust documentation (incl security advisory why mTLS is not needed for pushing data to the gateway)
- Merge the feature branch feat: autoscaling for telemetry gateways (new centralized OTLPGateway architecture) #3148
- Clean-up after feature branch has been merged
- Revert GHA changes in chore: add centralized architecture feature branch to GHA workflows #3040
- Remove legacy resources cleanup logic implemented in chore: leftovers for centralized OTLPGateway refactoring #3311
- Remove legacy gateway metrics from RMO Victoria Metrics configuration after rollout to regular (https://github.tools.sap/kyma/runtime-monitoring-operator/pull/964)
A follow-up will be #3146