Skip to content

ztunnel metrics cardinality explosion: stale pods in workload labels #1719

@romanpanov993

Description

@romanpanov993

By default, ztunnel exposes metrics at /stats/prometheus including source_workload and destination_workload labels for metrics:

  • istio_tcp_received_bytes_total
  • istio_tcp_connections_closed_total
  • istio_tcp_connections_opened_total
  • istio_tcp_sent_bytes_total
  • istio_on_demand_dns_total

In my environment, these labels are populated with specific pod names.

Description:

Istio Version: 1.28.2

I am facing two critical issues:

  1. Stale metrics retention: Ztunnel does not evict metrics for pods that have been deleted. Over time, as pods rotate, the cardinality of metrics grows indefinitely. This eventually causes the metrics response size to exceed the Prometheus max_scrape_size, leading to scrape failures and loss of observability.
  2. Telemetry API ignored: I attempted to mitigate this by applying a Telemetry resource to drop or override these high-cardinality labels (using tag_overrides). However, ztunnel seems to ignore these configurations, continuing to export raw pod names in the workload labels. Additionally, there appears to be no internal ztunnel configuration to toggle these labels off.

Impact:

Metrics collection completely breaks on nodes with long uptime or high pod turnover. The only current workaround is a manual restart of the ztunnel DaemonSet.

Expected behavior:

Eviction: Ztunnel should automatically purge metrics associated with workloads that are no longer present in its xDS/Workload state or have retention period
Configuration: Ztunnel should respect Telemetry API configurations for label dropping/overriding

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions