Skip to content

Memory spike on startup when there are a lot of replica setsΒ #44407

@ilidemi

Description

@ilidemi

Component(s)

processor/k8sattributes

Describe the issue you're reporting

We're running otel-collector in a pretty big k8s cluster with 16.7K replica sets. Recently we had to bump memory limits from 512MB to 1GB to avoid the collector pods getting stuck in a crash loop from requesting 600MB+, but after they get through the first minute they stabilize at ~180MB.

Capturing heap profiles every 500ms on startup revealed the following pattern (totals reported by pprof):

heap-1.pb.gz         191.72MB total
heap-2.pb.gz         191.72MB total
...
heap-11.pb.gz        191.72MB total
heap-12.pb.gz        191.72MB total
heap-13.pb.gz        341.29MB total
heap-14.pb.gz        341.29MB total
heap-15.pb.gz        341.29MB total
heap-16.pb.gz        341.29MB total
heap-17.pb.gz        341.29MB total
heap-18.pb.gz        341.29MB total
heap-19.pb.gz        53.83MB total
heap-20.pb.gz        53.83MB total
...
heap-51.pb.gz        51.10MB total
heap-52.pb.gz        51.10MB total

The first wave is mostly 157MB of (*runtime.Unknown).Unmarshal() in k8s client (text, svg).
The second wave is 157MB from (*runtime.Unknown).Unmarshal() and 176MB from (*v1.ReplicaSetList).Unmarshal() in k8s client (text, svg).
The final stable state is 23MB in kube.removeUnnecessaryReplicaSetData() in processor/k8sattributes (text, svg).

From what I understand in k8s attributes processor and k8s client code, collector needs metadata information about the replica sets, and to get that, it lists replica sets via the k8s client then discards the information it doesn't need. K8s client does it all at once without paging, so it needs to keep 20KB per resource in memory (10KB for temp Unknowns and 10KB for result), that's eventually stripped to 1.5KB per resource.

Claude suggests it should be possible to request just the metadata fields from k8s, deserialize them into a lighter contract and avoid the need for provisioning a higher memory container just for the spike at startup - example here. Is this something the team would be willing to consider?

Tip

React with πŸ‘ to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    component-stability-phase-1Part of the Phase 1 Component Stability Project.enhancementNew feature or requestnever staleIssues marked with this label will be never staled and automatically removedprocessor/k8sattributesk8s Attributes processor

    Type

    No type

    Projects

    Status

    Workable

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions