Skip to content

Conversation

chen-anders
Copy link

@chen-anders chen-anders commented Oct 12, 2025

addresses: #1931

This PR significantly enhances the debugging experience for OTLP exporters by:

  1. Adding rich context to export failure results
  2. Introducing comprehensive debug-level logging throughout the export pipeline
  3. Maintaining full backwards compatibility with existing exporter implementations

These changes ended up helping me debug a really gnarly issue where a slightly old version of the sentry-ruby SDK was causing issues with how the OpenTelemetry ruby SDK was bubbling up errors due to incorrect IPv6 parsing - causing all my traces to be dropped with an one-line error Unable to export X spans.

Reviewer's Note

Significant AI assistance was used in the process of getting this PR working.

Motivation

Previously, when OTLP exports failed, developers had minimal information to diagnose the root cause. The exporters simply returned a FAILURE constant without any context about:

  • What type of error occurred
  • HTTP response codes and messages
  • Response bodies from the collector
  • Retry attempts and their outcomes
  • Exception details

This made troubleshooting production issues extremely difficult, especially for:

  • Network connectivity problems
  • SSL/TLS certificate issues
  • Collector endpoint configuration errors
  • HTTP timeout scenarios
  • Server-side errors (4xx/5xx responses)

Changes

1. Enhanced Export Result Type (sdk/lib/opentelemetry/sdk/trace/export.rb)

Introduced a new ExportResult class that wraps result codes with optional error context:

class ExportResult
  attr_reader :code, :error, :message

  # Factory methods
  def self.success
  def self.failure(error: nil, message: nil)
  def self.timeout
end

Backwards Compatibility: The ExportResult class overloads the == operator and provides to_i to ensure existing code comparing results to SUCCESS, FAILURE, or TIMEOUT constants continues to work seamlessly.

2. Comprehensive Debug Logging

Added detailed debug-level logging at key points in the export pipeline:

Entry/Exit Points

  • Function entry with parameters (span count, timeout values)
  • Function exit with return values
  • Byte sizes (compressed vs uncompressed)

HTTP Request Flow

  • Request preparation and compression
  • Timeout calculations and retry counts
  • HTTP response codes and messages
  • Response bodies for error cases

Exception Handling

  • Exception type and message for all caught exceptions
  • Retry attempt tracking
  • Max retry exceeded scenarios

Examples of new debug logs:

OpenTelemetry.logger.debug("OTLP::Exporter#export: Called with #{span_data.size} spans, timeout=#{timeout.inspect}")
OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Compressed size=#{body.bytesize} bytes")
OpenTelemetry.logger.debug("OTLP::Exporter#send_bytes: Received response code=#{response.code}, message=#{response.message}")

3. Rich Failure Context

All failure scenarios now return detailed context via Export.failure():

HTTP Error Responses

OpenTelemetry::SDK::Trace::Export.failure(
  message: "export failed with HTTP #{response.code} (#{response.message}) after #{retry_count} retries: #{body}"
)

Network Exceptions

OpenTelemetry::SDK::Trace::Export.failure(
  error: e,
  message: "export failed due to SocketError after #{retry_count} retries: #{e.message}"
)

Timeout Scenarios

OpenTelemetry::SDK::Trace::Export.failure(
  message: 'timeout exceeded before sending request'
)

4. Enhanced BatchSpanProcessor Error Reporting

Updated BatchSpanProcessor to extract and log error context:

def report_result(result_code, span_array, error: nil, message: nil)
  if result_code == SUCCESS
    # ... metrics ...
  else
    if error
      OpenTelemetry.logger.error("BatchSpanProcessor: export failed due to #{error.class}: #{error.message}")
    elsif message
      OpenTelemetry.logger.error("BatchSpanProcessor: export failed: #{message}")
    else 
      OpenTelemetry.logger.error('BatchSpanProcessor: export failed (no error details available)')
      OpenTelemetry.logger.error("BatchSpanProcessor: call stack:\n#{caller.join("\n")}") 
    end
  end
end

5. Updated Exporters

Applied consistent changes to both:

  • OTLP default Exporter (exporter/otlp/lib/opentelemetry/exporter/otlp/exporter.rb)
  • OTLP HTTP Exporter (exporter/otlp-http/lib/opentelemetry/exporter/otlp/http/trace_exporter.rb)

Both now capture exception objects and maintain the error context through the entire export pipeline.

Example Scenarios

Before

ERROR -- : OpenTelemetry error: Unable to export 10 spans

After (with debug logging enabled)

DEBUG -- : OTLP::Exporter#export: Called with 10 spans, timeout=30.0
DEBUG -- : OTLP::Exporter#encode: Starting encode of 10 spans
DEBUG -- : OTLP::Exporter#encode: Successfully encoded to 1247 bytes
DEBUG -- : OTLP::Exporter#send_bytes: Compressed size=423 bytes
DEBUG -- : OTLP::Exporter#send_bytes: Sending HTTP request
DEBUG -- : OTLP::Exporter#send_bytes: Caught SocketError: Connection refused, retry_count=1
DEBUG -- : OTLP::Exporter#send_bytes: Max retries exceeded for SocketError
ERROR -- : BatchSpanProcessor: export failed due to SocketError: Connection refused - connect(2) for "localhost" port 4318
ERROR -- : OpenTelemetry error: Unable to export 10 spans

@chen-anders chen-anders force-pushed the anders/improve-debugging-ux branch from 3096a1c to 0dbb1a8 Compare October 12, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant