-
Notifications
You must be signed in to change notification settings - Fork 39
Description
I deployed OpenTelemetry collector sidecar on Cloud Run with my go services, followed a combination of https://cloud.google.com/run/docs/tutorials/custom-metrics-opentelemetry-sidecar and https://cloud.google.com/stackdriver/docs/instrumentation/opentelemetry-collector-cloud-run.
Now I'm seeing a lot of otelcol_exporter_send_failed_metric_points_total, which highly correlates with this log message, e.g.
{"service.instance.id": "7dfc51ac-42d9-4f69-8f56-baa17c5cde63", "service.name": "otelcol-google", "service.version": "0.128.0"}, "otelcol.component.id": "googlemanagedprometheus", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: timeSeries[0-37] (example metric.type=\"prometheus.googleapis.com/otelcol_processor_outgoing_items_total/counter\", metric.labels={\"otel_scope_name\": \"go.opentelemetry.io/collector/processor/processorhelper\", \"otel_signal\": \"metrics\", \"processor\": \"memory_limiter\", \"otel_scope_version\": \"\"}): write for resource=prometheus_target{job:otelcol-google,cluster:__run__,location:us-central1,instance:0069c7a988553fb09ee223f351cf533dd701f7998934c73fbeb3508cd65e77be20da697145fab3b44c65072d0f5d6056373db26ab4e29f4c646e5da17617d44860317b6624af1c728380955e,namespace:} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric. {Metric: prometheus.googleapis.com/otel_scope_info/gauge, Timestamps: {Youngest Existing: '2025/07/01-12:49:32.018', New: '2025/07/01-12:49:32.018'}}\nerror details: name = Unknown desc = total_point_count:38 success_point_count:36 errors:{status:{code:9} point_count:2}", "dropped_items": 30}
The Youngest Existing timestamp exactly matches the New timestamp. (Sometimes the New timestamp could be a couple milliseconds newer.)
I manually went through some log messages and only saw this happened for 2 metrics:
- prometheus.googleapis.com/otel_scope_info/gauge
- prometheus.googleapis.com/target_info/gauge
The main impact is, with a constant increments of otelcol_exporter_send_failed_metric_points_total, I'm not sure whether I'm losing any useful data points.
Collector configs (click to expand)
receivers:
# Open two OTLP servers:
# - On port 4317, open an OTLP GRPC server
# - On port 4318, open an OTLP HTTP server
#
# Docs:
# https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver/otlpreceiver
otlp:
protocols:
grpc:
endpoint: localhost:4317
http:
cors:
# This effectively allows any origin
# to make requests to the HTTP server.
allowed_origins:
- http://*
- https://*
endpoint: localhost:4318
processors:
# The batch processor is in place to regulate both the number of requests
# being made and the size of those requests.
#
# Docs:
# https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor
batch:
# batch metrics before sending to reduce API usage
# Configured to batch telemetry requests at the Google Cloud maximum number of entries per request, or at the Google Cloud minimum interval of every 5 seconds (whichever comes first).
send_batch_max_size: 200
send_batch_size: 200
timeout: 5s
# The memorylimiter will check the memory usage of the collector process.
#
# Docs:
# https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/memorylimiterprocessor
memory_limiter:
# drop metrics if memory usage gets too high
check_interval: 1s
limit_percentage: 65
spike_limit_percentage: 20
# The resourcedetection processor is configured to detect GCP resources.
# Resource attributes that represent the GCP resource the collector is
# running on will be attached to all telemetry that goes through this
# processor.
#
# Docs:
# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor
# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor#gcp-metadata
# automatically detect Cloud Run resource metadata
resourcedetection:
detectors: [env, gcp]
timeout: 2s
override: false
# The transform/collision processor ensures that any attributes that may
# collide with the googlemanagedprometheus exporter's monitored resource
# construction are moved to a similar name that is not reserved.
transform/collision:
metric_statements:
- context: datapoint
statements:
- set(attributes["exported_location"], attributes["location"])
- delete_key(attributes, "location")
- set(attributes["exported_cluster"], attributes["cluster"])
- delete_key(attributes, "cluster")
- set(attributes["exported_namespace"], attributes["namespace"])
- delete_key(attributes, "namespace")
- set(attributes["exported_job"], attributes["job"])
- delete_key(attributes, "job")
- set(attributes["exported_instance"], attributes["instance"])
- delete_key(attributes, "instance")
- set(attributes["exported_project_id"], attributes["project_id"])
- delete_key(attributes, "project_id")
resource:
attributes:
# add instance_id as a resource attribute
- key: service.instance.id
from_attribute: faas.id
action: upsert
# parse service name from K_SERVICE Cloud Run variable
- key: service.name
value: ${env:K_SERVICE}
action: insert
exporters:
# The googlemanagedprometheus exporter will send metrics to
# Google Managed Service for Prometheus.
#
# Docs:
# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/googlemanagedprometheusexporter
googlemanagedprometheus: # Note: this is intentionally left blank
extensions:
# Opens an endpoint on 13133 that can be used to check the
# status of the collector. Since this does not configure the
# `path` config value, the endpoint will default to `/`.
#
# When running on Cloud Run, this extension is required and not optional.
# In other environments it is recommended but may not be required for operation
# (i.e. in Container-Optimized OS or other GCE environments).
#
# Docs:
# https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension
health_check:
endpoint: 0.0.0.0:13133
service:
extensions:
- health_check
pipelines:
metrics/otlp:
receivers:
- otlp
processors:
- transform/collision
- resourcedetection
- memory_limiter
- batch
- resource
exporters:
- googlemanagedprometheus
# Internal telemetry for the collector supports both push and pull-based telemetry data transmission.
# Leveraging the pre-configured OTLP receiver eliminates the need for an additional port.
#
# Docs:
# https://opentelemetry.io/docs/collector/internal-telemetry/
telemetry:
metrics:
readers:
- periodic:
exporter:
otlp:
protocol: grpc
endpoint: localhost:4317
Other information:
- Image: us-docker.pkg.dev/cloud-ops-agents-artifacts/google-cloud-opentelemetry-collector/otelcol-google:0.128.0
- CPU: 0.1
- Memory: 128MB