diff --git a/troubleshoot/ingest/opentelemetry/edot-collector/trace-export-errors.md b/troubleshoot/ingest/opentelemetry/edot-collector/trace-export-errors.md new file mode 100644 index 0000000000..b0fcb1ab3a --- /dev/null +++ b/troubleshoot/ingest/opentelemetry/edot-collector/trace-export-errors.md @@ -0,0 +1,95 @@ +--- +navigation_title: Trace export errors +description: Learn how to resolve trace export failures caused by `sending_queue` overflow and Elasticsearch exporter timeouts in the EDOT Collector. +applies_to: + serverless: all + product: + edot_collector: ga +products: + - id: observability + - id: edot-collector +--- + +# Trace export errors from the EDOT Collector + +During high traffic or load testing scenarios, the EDOT Collector might fail to export trace data to {{es}}. This typically happens when the internal queue for outgoing data fills up faster than it can be drained, resulting in timeouts and dropped data. + +## Symptoms + +You might see one or more of the following messages in the EDOT Collector logs: + +* `bulk indexer flush error: failed to execute the request: context deadline exceeded` +* `Exporting failed. Rejecting data. sending queue is full` +* Repeated `otelcol.signal: "traces"` errors from the exporter + +These errors indicate the Collector is overwhelmed and unable to export traces fast enough, leading to queue overflows and data loss. + +## Causes + +This issue typically occurs when the `sending_queue` configuration is misaligned with the incoming trace volume. Common contributing factors include: + +* `sending_queue.block_on_overflow` is not enabled (it defaults to `false`), so data is dropped when the queue is full. +* `num_consumers` is too low to keep up with the incoming trace volume and drain the queue efficiently. +* The queue size (`queue_size`) is too small for the traffic load. +* Export batching is disabled, increasing processing overhead. +* EDOT Collector resources (CPU, memory) are not sufficient for the traffic volume. + +:::{note} +Increasing the `timeout` value (for example from 30s to 90s) doesn't help if the queue itself is the bottleneck. +::: + +## Resolution + +Update the EDOT Collector configuration as follows: + +:::::{stepper} + +::::{step} Enable `block_on_overflow` + +Prevent silent trace drops by enabling blocking behavior when the queue is full: + +```yaml +sending_queue: + enabled: true + queue_size: 1000 + num_consumers: 10 + block_on_overflow: true +``` +:::: + +::::{step} Increase `num_consumers` + +Raise the number of queue consumers to increase parallel processing of queued items. Start with 20–30 and adjust based on throughput and resource usage. + +:::: + +::::{step} Tune `queue_size` + +Increase the queue size to handle spikes in trace volume. Ensure sufficient memory is allocated to support the larger buffer. + +:::: + +::::{step} Enable batching + +If not already enabled, configure batching to reduce the per-span export cost and improve throughput. + +:::: + +::::{step} Check resource allocation + +Verify the EDOT Collector pod has enough CPU and memory. Increase limits or scale out the deployment if necessary. + +:::: + +::::{step} Evaluate {{es}} performance + +Check for indexing delays or errors on the {{es}} side. Bottlenecks here can also contribute to timeouts and queue buildup. + +:::: + +::::: + + +## Resources + +* [Upstream documentation - OpenTelemetry Collector configuration](https://opentelemetry.io/docs/collector/configuration) \ No newline at end of file diff --git a/troubleshoot/ingest/opentelemetry/toc.yml b/troubleshoot/ingest/opentelemetry/toc.yml index 6c363e1417..778d0d06ec 100644 --- a/troubleshoot/ingest/opentelemetry/toc.yml +++ b/troubleshoot/ingest/opentelemetry/toc.yml @@ -10,6 +10,7 @@ toc: - file: edot-collector/enable-debug-logging.md - file: edot-collector/collector-not-starting.md - file: edot-collector/misconfigured-sampling-collector.md + - file: edot-collector/trace-export-errors.md - file: edot-sdks/index.md children: - file: edot-sdks/android/index.md