-
Notifications
You must be signed in to change notification settings - Fork 667
Description
OTEL collector tries to export traces to tempo, but few events were failing to be exported to tempo, Upon troubleshooting, it's observed that, ingesters goes down because scale down activity and on the same time we observed below errors on distributor and otel collector and spans being refused.
error on tempo distributor:
level=error ts=2026-02-21T12:41:.807200295Z caller=rate_limited_logger.go:38 msg="pusher failed to consume trace data" err="rpc error: code = Unknown desc = Ingester is shutting down"
error on otel collector:
2026-02-21T12:41:40.214Z error internal/queue_sender.go:57 Exporting failed. Dropping data. {"resource": {"service.instance.id": "*********d24bf9d1e", "service.name": "otelcol-contrib", "service.version": "0.128.0"}, "otelcol.component.id": "otlphttp/****-processor-traces-tail-sampling", "otelcol.component.kind": "exporter", "otelcol.signal": "traces", "error": "not retryable error: Permanent error: rpc error: code = Unknown desc = error exporting items, request to http://**********-tempo-distributor.hyperion-traces:4318/v1/traces responded with HTTP Status Code 500, Message=Ingester is shutting down, Details=[]", "dropped_items": 32}
the span are being rejected/refused are seen on tempo write dashboard as below:
here importantly to mention we are making use of keda autoscaling as well as HPA for scaling porpose with different triggers for the same. The autoscaling configuration used is as below:
`` autoscaling:
enabled: true
minReplicas: 7
maxReplicas: 70
targetCPUUtilizationPercentage: 60
targetMemoryUtilizationPercentage: 60
# -- Autoscaling via keda/ScaledObject
keda:
enabled: true
pollingInterval: 30
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 30
periodSeconds: 60
fallback:
failureThreshold: 5
replicas: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://*********
metricName: total rate of push requests received by Tempo ingester from the distributor over the gRPC Push endpoint.
query: sum(rate(tempo_request_duration_seconds_count{ pod="*****-ingester.", route="/tempopb.Pusher/Push."}[5m]))
threshold: "1200"
timeout: 60
- type: prometheus
metadata:
serverAddress: http://************
metricName: tempo_ingester_query_requests
# This query monitors the rate of query requests handled by the Ingester component
# via the /tempopb.Querier/.* route. It measures read path load on the ingester,
# enabling KEDA to autoscale the Ingester to meet query demand without compromising performance.
query: sum(rate(tempo_request_duration_seconds_count{job="tempo_ingester.*", route="/tempopb.Querier/.*"}[5m]))
threshold: 600
timeout: 60
- type: cpu
metadata:
type: Utilization
value: 60
- type: memory
metadata:
type: Utilization
value: 60`
We have performed testing as mentioned below, but we continue to see the same problem.
-
Added a lifecycle hook that executes shutdown before SIGTERM is sent to the pod.
lifecycle: preStop: httpGet: path: /shutdown port: http-metrics scheme: HTTP -
Config to shutdown
ingester: lifecycler: unregister_on_shutdown: true distributor: extend_writes: true -
defined HPA scale down behavior to scale down only 1 ingester every 30mins
behavior: scaleDown: stabilizationWindowSeconds: 10 selectPolicy: Min policies: - type: Pods value: 1 periodSeconds: 1800 -
we have added a PDB on top with maxUnavailable: 1.
To ReproduceSteps to reproduce the behavior:
- Start Tempo (SHA or version)
- Perform Operations (Read/Write/Others)
Expected behavior
Expected behaviour is that when ingester shutting down error is coming then distributor should retry the span ingestion to other ingester pods and should not retry to same pod and hence we are getting data drop issue here
Environment:
- Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
- Deployment tool: [e.g., helm, jsonnet]
Additional Context
Otel collector version: otel app version 0.128.0 and helm version 0.127.2
replication factor 3
Tempo distributed version: 1.60.0 and appVersion: 2.9.0
Reference:
#5518
#4493
#6344