Autoscaling for telemetry gateways

**Description**
A manual scaling of the 3 gateways is supported. However, that requires introspection of the gateway health by the user which requires advanced knowledge and will not be done auto-magically on-demand. In best case the gateways would scale up and down automatically dependent on the load, saving potentially resources as well.

In the most simple form the gateway could be scaled by memory using an HPA. Here, just the proper tuning needs to be found and an HPA will be managed by the operator. However, typically the collector should be scaled on base of incoming requests as criteria. Also it should not be scaled on any problems with the backend like backpressure, see https://opentelemetry.io/docs/collector/scaling/. So a better approach is to manage the scaling based on metrics. This could be possible using k8s mechanisms by using prometheus and the prometheus-adapter or keda to feed the HPA controller with custom metrics. However, that will complicate the setup a lot.

Another approach of solving the scalability problems, is to give up the concept of a central gateway and switch to a daemonset approach, where an instance is available per node and with that supports a natural scaling with the load. Such approach will have multiple other aspects like removing istio from the picture for load-balancing across gateway replicas. Combined with a VPA, this setup can bring many advantages over a centralized gateway approach with a complex autoscaling mechanism.

**Goal**
Have autoscaling of the gateway in place so that the user don't need to gain knowledge about when to scale manually. The scaling should be kept simple but feeding the purpose.

**Reasons**
It should not be the users concern on when to scale up or down


Tasks:
- [x] Decouple the metric agent from the gateway (https://github.com/kyma-project/telemetry-manager/issues/1475)
- [x] Decide on the approach of how to scale the gateway regarding (https://github.com/kyma-project/telemetry-manager/blob/main/docs/contributor/arch/019-switch-from-gateways-to-a-central-agent.md) by looking into these aspects:
   - [x] performance (https://github.com/kyma-project/telemetry-manager/issues/2656)
   - [x] daemonset rollout (https://github.com/kyma-project/telemetry-manager/issues/2710)
   - [x] routing (https://github.com/kyma-project/telemetry-manager/issues/2711)
- [ ] Convert the gateway into a Daemonset
  - [x] introduce feature flag and add new daemonset for logs only, having istio sidecar only for outgoing traffic (https://github.com/kyma-project/telemetry-manager/issues/2748)
  - [x] Design a fitting reconciler architecture (https://github.com/kyma-project/telemetry-manager/issues/2944)
  - [x] Implement TracePipeline components for the central OTLP gateway architecture https://github.com/kyma-project/telemetry-manager/issues/3006
  - [x] Finalize the OTLP gateway by adding support for the metric gateway https://github.com/kyma-project/telemetry-manager/issues/2921
  - [x] Switch services to new daemonset and have node-local traffic only
  - [ ] Implement legacy resources cleanup temporary logic during reconciliation (ex: gateway Deployments)
  - [x] Adjust the self-monitor and health checks to support both scenarios
  - [x] Adjust the tests so that both scenarios are testable
  - [ ] Adjust the load tests
  - [ ] Deprecate the manual scaling options, having no effect anymore, remove from busola views, also print a warning via validatingwebhook?
  - [ ] Adjust documentation (incl security advisory why mTLS is not needed for pushing data to the gateway)
  - [ ] Merge the feature branch https://github.com/kyma-project/telemetry-manager/pull/3148
- [ ] Clean-up after feature branch has been merged
  - [ ] Revert GHA changes in https://github.com/kyma-project/telemetry-manager/pull/3040
  - [ ] Remove legacy resources cleanup logic implemented in https://github.com/kyma-project/telemetry-manager/pull/3311
  - [ ] Remove legacy gateway metrics from RMO Victoria Metrics configuration after rollout to regular (https://github.tools.sap/kyma/runtime-monitoring-operator/pull/964)

A follow-up will be https://github.com/kyma-project/telemetry-manager/issues/3146

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling for telemetry gateways #424

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Autoscaling for telemetry gateways #424

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions