OpenTelemetry collector failed to export trace data because the ingesters shutting down

OTEL collector tries to export traces to tempo, but few events were failing to be exported to tempo, Upon troubleshooting, it's observed that, ingesters goes down because scale down activity and on the same time we observed below errors on distributor and otel collector and spans being refused. 
error on tempo distributor:
`level=error ts=2026-02-21T12:41:.807200295Z caller=rate_limited_logger.go:38 msg="pusher failed to consume trace data" err="rpc error: code = Unknown desc = Ingester is shutting down"`
error on otel collector:
`2026-02-21T12:41:40.214Z	error	internal/queue_sender.go:57	Exporting failed. Dropping data.	{"resource": {"service.instance.id": "*********d24bf9d1e", "service.name": "otelcol-contrib", "service.version": "0.128.0"}, "otelcol.component.id": "otlphttp/****-processor-traces-tail-sampling", "otelcol.component.kind": "exporter", "otelcol.signal": "traces", "error": "not retryable error: Permanent error: rpc error: code = Unknown desc = error exporting items, request to http://**********-tempo-distributor.hyperion-traces:4318/v1/traces responded with HTTP Status Code 500, Message=Ingester is shutting down, Details=[]", "dropped_items": 32}`
the span are being rejected/refused are seen on tempo write dashboard as below:

<img width="927" height="405" alt="Image" src="https://github.com/user-attachments/assets/3ac522e6-bb20-41bc-9d7e-960c86c4e9d4" />
here importantly to mention we are making use of keda autoscaling as well as HPA for scaling porpose with different triggers for the same. The autoscaling configuration used is as below:

``  autoscaling:
    enabled: true
    minReplicas: 7
    maxReplicas: 70
    targetCPUUtilizationPercentage: 60
    targetMemoryUtilizationPercentage: 60
    # -- Autoscaling via keda/ScaledObject
    keda:
      enabled: true
      pollingInterval: 30
      advanced:
        horizontalPodAutoscalerConfig:
          behavior:
            scaleDown:
              stabilizationWindowSeconds: 300
              policies:
                - type: Percent
                  value: 30
                  periodSeconds: 60
      fallback:
        failureThreshold: 5
        replicas: 15
      triggers: 
        - type: prometheus
          metadata:
            serverAddress: http://*********
            metricName: total rate of push requests received by Tempo ingester from the distributor over the gRPC Push endpoint. 
            query: sum(rate(tempo_request_duration_seconds_count{ pod=~"******-ingester.*", route=~"/tempopb.Pusher/Push.*"}[5m]))
            threshold: "1200"
            timeout: 60
        - type: prometheus
          metadata:
            serverAddress: http://*************
            metricName: tempo_ingester_query_requests
            # This query monitors the rate of query requests handled by the Ingester component 
            # via the /tempopb.Querier/.* route. It measures read path load on the ingester, 
            # enabling KEDA to autoscale the Ingester to meet query demand without compromising performance.
            query: sum(rate(tempo_request_duration_seconds_count{job=~"tempo_ingester.*", route=~"/tempopb.Querier/.*"}[5m]))
            threshold: 600
            timeout: 60
        - type: cpu
          metadata:
            type: Utilization
            value: 60
        - type: memory
          metadata:
            type: Utilization
            value: 60`

We have performed testing as mentioned below, but we continue to see the same problem.
1. Added a lifecycle hook that executes shutdown before SIGTERM is sent to the pod.  
`          lifecycle:
            preStop:
              httpGet:
                path: /shutdown
                port: http-metrics
                scheme: HTTP`

2. Config to shutdown
`  ingester:
    lifecycler:
      unregister_on_shutdown: true
  distributor:
    extend_writes: true`

3. defined HPA scale down behavior to scale down only 1 ingester every 30mins
`    behavior:
      scaleDown:
        stabilizationWindowSeconds: 10
        selectPolicy: Min
        policies:
          - type: Pods
            value: 1
            periodSeconds: 1800`
4. we have added a PDB on top with maxUnavailable: 1. 

**To Reproduce**Steps to reproduce the behavior:
1. Start Tempo (SHA or version)
2. Perform Operations (Read/Write/Others)

**Expected behavior**
Expected behaviour is that when ingester shutting down error is coming then distributor should retry the span ingestion to other ingester pods and should not retry to same pod and hence we are getting data drop issue here

**Environment:**
 - Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
 - Deployment tool: [e.g., helm, jsonnet]

**Additional Context**
Otel collector version: otel app version 0.128.0 and helm version 0.127.2
replication factor 3
Tempo distributed version: 1.60.0 and appVersion: 2.9.0
Reference:
https://github.com/grafana/tempo/issues/5518
https://github.com/grafana/tempo/issues/4493
https://github.com/grafana/tempo/issues/6344

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry collector failed to export trace data because the ingesters shutting down #6530

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenTelemetry collector failed to export trace data because the ingesters shutting down #6530

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions