-
Notifications
You must be signed in to change notification settings - Fork 65
Description
name: logs exporting fails frequently
about: Trying to get log records in cloudwatch that follow the otel structured logging format to allow for monitoring.
title: Logs exporting fails every now and then.
labels: bug
Describe the bug
I'm using the OTEL endpoints provided by an AWS account with transaction service enabled.
This requires a non-default configuration file with the following config:
receivers:
otlp:
protocols:
grpc:
endpoint: "localhost:4317"
exporters:
otlphttp/traces:
compression: gzip
traces_endpoint: https://xray.eu-west-1.amazonaws.com/v1/traces
auth:
authenticator: sigv4auth/traces
otlphttp/logs:
compression: gzip
logs_endpoint: https://logs.eu-west-1.amazonaws.com/v1/logs
auth:
authenticator: sigv4auth/logs
headers:
x-aws-log-group: ${env:OTEL_LOG_GROUP_NAME}
x-aws-log-stream: ${env:OTEL_LOG_STREAM_NAME}
extensions:
sigv4auth/logs:
region: "eu-west-1"
service: "logs"
sigv4auth/traces:
region: "eu-west-1"
service: "xray"
service:
extensions: [sigv4auth/logs, sigv4auth/traces]
pipelines:
traces:
receivers: [otlp]
exporters: [otlphttp/traces]
logs:
receivers: [otlp]
exporters: [otlphttp/logs]
(I have left out my awsemf
metrics configuration for brevity)
I use the arn:aws:lambda:eu-west-1:901920570463:layer:aws-otel-collector-amd64-ver-0-117-0:1
collector layer.
Most logs end up in the cloudwatch stream I created and they are delivered without issue. This allows me to run a cloudwatch log metric on top of the stream using the severityNumber
to filter anything above warning level and trigger an alarm if we have an error.
However, some logs are dropped:
2025-09-02T18:10:53.223Z
{
"level": "error",
"ts": 1756836653.2207713,
"caller": "internal/base_exporter.go:128",
"msg": "Exporting failed. Rejecting data. Try enabling sending_queue to survive temporary failures.",
"kind": "exporter",
"data_type": "logs",
"name": "otlphttp/logs",
"error": "request is cancelled or timed out failed to make an HTTP request: Post \"https://logs.eu-west-1.amazonaws.com/v1/logs\": EOF",
"rejected_items": 4,
"stacktrace": "go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*BaseExporter).Send\n\tgo.opentelemetry.io/collector/[email protected]/exporterhelper/internal/base_exporter.go:128\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewLogsRequest.func1\n\tgo.opentelemetry.io/collector/[email protected]/exporterhelper/logs.go:136\ngo.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/logs.go:26\ngo.opentelemetry.io/collector/consumer.ConsumeLogsFunc.ConsumeLogs\n\tgo.opentelemetry.io/collector/[email protected]/logs.go:26\ngo.opentelemetry.io/collector/receiver/otlpreceiver/internal/logs.(*Receiver).Export\n\tgo.opentelemetry.io/collector/receiver/[email protected]/internal/logs/otlp.go:41\ngo.opentelemetry.io/collector/pdata/plog/plogotlp.rawLogsServer.Export\n\tgo.opentelemetry.io/collector/[email protected]/plog/plogotlp/grpc.go:88\ngo.opentelemetry.io/collector/pdata/internal/data/protogen/collector/logs/v1._LogsService_Export_Handler.func1\n\tgo.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/logs/v1/logs_service.pb.go:311\ngo.opentelemetry.io/collector/config/configgrpc.(*ServerConfig).getGrpcServerOptions.enhanceWithClientInformation.func9\n\tgo.opentelemetry.io/collector/config/[email protected]/configgrpc.go:517\ngo.opentelemetry.io/collector/pdata/internal/data/protogen/collector/logs/v1._LogsService_Export_Handler\n\tgo.opentelemetry.io/collector/[email protected]/internal/data/protogen/collector/logs/v1/logs_service.pb.go:313\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/[email protected]/server.go:1405\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1815\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\tgoogle.golang.org/[email protected]/server.go:1035"
}
The error suggests I should consider the sending_queue
feature, but this is part of the exporterhelper and not included in the collector. Other solutions prompted by the internet, like the decouple or batch processor are also not part of the collector.
One mention on the internet said the timeout might be because there was not enough memory, but the lambda currently has 1024MB of memory and has only really simple Python "fetch a record from dynamodb and return" code.
Steps to reproduce
Create a lambda with Python runtime
Let it log a line of info when using
Call the lambda multiple times
Wait for the issue to arize
The collector side will say
request is cancelled or timed out failed to make an HTTP request: Post \"https://logs.eu-west-1.amazonaws.com/v1/logs\": EOF
and the Python code will say
"level": "ERROR",
"message": "Failed to export logs to localhost:4317, error code: StatusCode.DEADLINE_EXCEEDED",
"logger": "opentelemetry.exporter.otlp.proto.grpc.exporter",
"requestId": "ad9c1922-719c-441e-b0e7-14315824d366",
"otelSpanID": "0",
"otelTraceID": "0",
and two log messages will be missing from cloudwatch but available in the stdout output from the lambda.
What did you expect to see?
I would expect it to be possible to use logging in a way that gives me actual log records with severityNumber in cloudwatch so I can effectively monitor for anythin above warning or that does not have a severityNumber present in the log stream.
What did you see instead?
Failure to use the OTEL log endpoints in AWS
What version of collector/language SDK version did you use?
arn:aws:lambda:eu-west-1:901920570463:layer:aws-otel-collector-amd64-ver-0-117-0:1
collector layer
and the following Python libraries
opentelemetry-api==1.36.0 \
opentelemetry-distro==0.57b0 \
opentelemetry-exporter-otlp==1.36.0 \
opentelemetry-exporter-otlp-proto-common==1.36.0 \
opentelemetry-exporter-otlp-proto-grpc==1.36.0 \
opentelemetry-exporter-otlp-proto-http==1.36.0 \
opentelemetry-instrumentation==0.57b0 \
opentelemetry-instrumentation-asgi==0.57b0 \
opentelemetry-instrumentation-aws-lambda==0.57b0 \
opentelemetry-instrumentation-botocore==0.57b0 \
opentelemetry-instrumentation-fastapi==0.57b0 \
opentelemetry-instrumentation-logging==0.57b0 \
opentelemetry-propagator-aws-xray==1.0.2 \
opentelemetry-propagator-b3==1.36.0 \
opentelemetry-proto==1.36.0 \
opentelemetry-sdk==1.36.0 \
opentelemetry-sdk-extension-aws==2.1.0 \
opentelemetry-semantic-conventions==0.57b0 \
opentelemetry-util-http==0.57b0 \
What language layer did you use?
Python
Additional context
I'm currently bound to the aws-otel-lambda because I can't find a way to get cloudwatch metrics to work based on opentelemetry without using the awsemf
exporter and that is not in the community collector layer for lambda.