You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: troubleshoot/ingest/opentelemetry/edot-collector/trace-export-errors.md
+38-8Lines changed: 38 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,22 +25,23 @@ These errors indicate the Collector is overwhelmed and unable to export data fas
25
25
26
26
## Causes
27
27
28
-
This issue typically occurs when the `sending_queue` configuration is misaligned with the incoming telemetry volume.
28
+
This issue typically occurs when the `sending_queue` configuration or the Elasticsearch cluster scaling is misaligned with the incoming telemetry volume.
29
29
30
30
:::{important}
31
31
The sending queue is disabled by default in versions earlier than **v0.138.0** and enabled by default from **v0.138.0** onward. If you're using an earlier version, verify that `enabled: true` is explicitly set — otherwise any queue configuration will be ignored.
32
32
:::
33
33
34
34
Common contributing factors include:
35
35
36
-
*`sending_queue.block_on_overflow` is not enabled (it defaults to `false`), so data is dropped when the queue is full.
36
+
* Underscaled Elasticsearch cluster is the most frequent cause of persistent export failures. If Elasticsearch cannot index data fast enough, the Collector’s queue fills up.
37
+
*`sending_queue.block_on_overflow` is disabled in **pre-v0.138.0** releases (defaults to `false`), which can lead to silent data drops. Starting from **v0.138.0**, the Elasticsearch exporter enables this setting by default.
37
38
*`num_consumers` is too low to keep up with the incoming data volume.
38
39
* The queue size (`queue_size`) is too small for the traffic load.
39
40
* Export batching is disabled, increasing processing overhead.
40
-
* EDOT Collector resources (CPU, memory) are not sufficient for the traffic volume.
41
+
* EDOT Collector resources (CPU, memory) are insufficient for the traffic volume.
41
42
42
43
:::{note}
43
-
Increasing the `timeout` value (for example from 30s to 90s) doesn't help if the queue itself is the bottleneck.
44
+
Increasing the `timeout` value (for example from 30s to 90s) doesn't help if the queue itself or Elasticsearch throughput is the bottleneck.
44
45
:::
45
46
46
47
## Resolution
@@ -61,16 +62,45 @@ sending_queue:
61
62
62
63
### For EDOT Collector v0.138.0 and later
63
64
64
-
The `sending_queue` behavior is managed internally by the exporter. Adjusting its parameters has a limited effect on throughput. In these versions, the most effective optimizations are:
65
+
The Elasticsearch exporter provides default `sending_queue` parameters (including `block_on_overflow: true`) but these can and often should be tuned for specific workloads.
65
66
66
-
* Increase Collector resources by ensuring the EDOT Collector pod has enough CPU and memory. Scale vertically (more resources) or horizontally (more replicas) if you experience backpressure.
67
+
The following steps can help identify and resolve export bottlenecks:
67
68
68
-
* Optimize Elasticsearch performance by checking for indexing delays, rejected bulk requests, or cluster resource limits. Bottlenecks in {{es}} often manifest as Collector export timeouts.
69
+
:::::{stepper}
70
+
71
+
::::{step} Check the Collector's internal metrics
72
+
73
+
If internal telemetry is enabled, review these metrics:
74
+
75
+
* `otelcol.elasticsearch.bulk_requests.latency` — high tail latency suggests Elasticsearch is the bottleneck. Check Elasticsearch cluster metrics and scale if necessary.
76
+
77
+
* `otelcol.elasticsearch.bulk_requests.count` and `otelcol.elasticsearch.flushed.bytes` — they help assess whether the Collector is sending too many or too large requests. Tune `sending_queue.num_consumers` or batching configuration to balance throughput.
78
+
79
+
* `otelcol_exporter_queue_size` and `otelcol_exporter_queue_capacity` — if the queue runs near capacity, but Elasticsearch is healthy, increase the queue size or number of consumers.
For a complete list of available metrics, refer to the upstream OpenTelemetry metadata files for the [Elasticsearch exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/elasticsearchexporter/metadata.yaml) and [exporter helper](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/metadata.yaml).
84
+
::::
85
+
86
+
::::{step} Scale the Collector's resources
87
+
88
+
* Ensure sufficient CPU and memory for the EDOT Collector.
89
+
* Scale vertically (more resources) or horizontally (more replicas) as needed.
90
+
::::
91
+
92
+
::::{step} Optimize Elasticsearch performance
93
+
94
+
Address indexing delays, rejected bulk requests, or shard imbalances that limit ingestion throughput.
95
+
::::
96
+
97
+
:::::
69
98
70
99
:::{tip}
71
-
Focus tuning efforts on the Collector’s resource allocation and the downstream Elasticsearch cluster rather than queue parameters for v0.138.0+.
100
+
For **v0.138.0+**, focus tuning efforts on Elasticsearch performance, Collector resource allocation, and queue sizing informed by the internal telemetry metrics above.
0 commit comments