Skip to content

Commit f42aae4

Browse files
committed
Apply second round of comments
1 parent cf1b9e8 commit f42aae4

File tree

1 file changed

+38
-8
lines changed

1 file changed

+38
-8
lines changed

troubleshoot/ingest/opentelemetry/edot-collector/trace-export-errors.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,22 +25,23 @@ These errors indicate the Collector is overwhelmed and unable to export data fas
2525

2626
## Causes
2727

28-
This issue typically occurs when the `sending_queue` configuration is misaligned with the incoming telemetry volume.
28+
This issue typically occurs when the `sending_queue` configuration or the Elasticsearch cluster scaling is misaligned with the incoming telemetry volume.
2929

3030
:::{important}
3131
The sending queue is disabled by default in versions earlier than **v0.138.0** and enabled by default from **v0.138.0** onward. If you're using an earlier version, verify that `enabled: true` is explicitly set — otherwise any queue configuration will be ignored.
3232
:::
3333

3434
Common contributing factors include:
3535

36-
* `sending_queue.block_on_overflow` is not enabled (it defaults to `false`), so data is dropped when the queue is full.
36+
* Underscaled Elasticsearch cluster is the most frequent cause of persistent export failures. If Elasticsearch cannot index data fast enough, the Collector’s queue fills up.
37+
* `sending_queue.block_on_overflow` is disabled in **pre-v0.138.0** releases (defaults to `false`), which can lead to silent data drops. Starting from **v0.138.0**, the Elasticsearch exporter enables this setting by default.
3738
* `num_consumers` is too low to keep up with the incoming data volume.
3839
* The queue size (`queue_size`) is too small for the traffic load.
3940
* Export batching is disabled, increasing processing overhead.
40-
* EDOT Collector resources (CPU, memory) are not sufficient for the traffic volume.
41+
* EDOT Collector resources (CPU, memory) are insufficient for the traffic volume.
4142

4243
:::{note}
43-
Increasing the `timeout` value (for example from 30s to 90s) doesn't help if the queue itself is the bottleneck.
44+
Increasing the `timeout` value (for example from 30s to 90s) doesn't help if the queue itself or Elasticsearch throughput is the bottleneck.
4445
:::
4546

4647
## Resolution
@@ -61,16 +62,45 @@ sending_queue:
6162
6263
### For EDOT Collector v0.138.0 and later
6364
64-
The `sending_queue` behavior is managed internally by the exporter. Adjusting its parameters has a limited effect on throughput. In these versions, the most effective optimizations are:
65+
The Elasticsearch exporter provides default `sending_queue` parameters (including `block_on_overflow: true`) but these can and often should be tuned for specific workloads.
6566

66-
* Increase Collector resources by ensuring the EDOT Collector pod has enough CPU and memory. Scale vertically (more resources) or horizontally (more replicas) if you experience backpressure.
67+
The following steps can help identify and resolve export bottlenecks:
6768

68-
* Optimize Elasticsearch performance by checking for indexing delays, rejected bulk requests, or cluster resource limits. Bottlenecks in {{es}} often manifest as Collector export timeouts.
69+
:::::{stepper}
70+
71+
::::{step} Check the Collector's internal metrics
72+
73+
If internal telemetry is enabled, review these metrics:
74+
75+
* `otelcol.elasticsearch.bulk_requests.latency` — high tail latency suggests Elasticsearch is the bottleneck. Check Elasticsearch cluster metrics and scale if necessary.
76+
77+
* `otelcol.elasticsearch.bulk_requests.count` and `otelcol.elasticsearch.flushed.bytes` — they help assess whether the Collector is sending too many or too large requests. Tune `sending_queue.num_consumers` or batching configuration to balance throughput.
78+
79+
* `otelcol_exporter_queue_size` and `otelcol_exporter_queue_capacity` — if the queue runs near capacity, but Elasticsearch is healthy, increase the queue size or number of consumers.
80+
81+
* `otelcol_enqueue_failed_spans`, `otelcol_enqueue_failed_metric_points`, `otelcol_enqueue_failed_log_records` — persistent enqueue failures indicate undersized queues or slow consumers.
82+
83+
For a complete list of available metrics, refer to the upstream OpenTelemetry metadata files for the [Elasticsearch exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/elasticsearchexporter/metadata.yaml) and [exporter helper](https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/metadata.yaml).
84+
::::
85+
86+
::::{step} Scale the Collector's resources
87+
88+
* Ensure sufficient CPU and memory for the EDOT Collector.
89+
* Scale vertically (more resources) or horizontally (more replicas) as needed.
90+
::::
91+
92+
::::{step} Optimize Elasticsearch performance
93+
94+
Address indexing delays, rejected bulk requests, or shard imbalances that limit ingestion throughput.
95+
::::
96+
97+
:::::
6998

7099
:::{tip}
71-
Focus tuning efforts on the Collector’s resource allocation and the downstream Elasticsearch cluster rather than queue parameters for v0.138.0+.
100+
For **v0.138.0+**, focus tuning efforts on Elasticsearch performance, Collector resource allocation, and queue sizing informed by the internal telemetry metrics above.
72101
:::
73102

103+
74104
## Resources
75105

76106
* [Upstream documentation - OpenTelemetry Collector configuration](https://opentelemetry.io/docs/collector/configuration)

0 commit comments

Comments
 (0)