Skip to content

OpenTelemetry tracing integration #26

@evanvolgas

Description

@evanvolgas

Overview

Integrate OpenTelemetry distributed tracing for production observability.

Background

From STAFF_REVIEW.md: "Can you trace a request through all abstraction layers?"

Current stack has 5+ abstraction layers:

User Request
  └── Conduit Router
      └── PydanticAI Agent
          └── OpenAI/Anthropic SDK
              └── HTTP client
                  └── Provider API

Goals

  • Trace requests end-to-end across all layers
  • Identify latency bottlenecks in production
  • Debug issues across distributed components
  • Monitor bandit algorithm decision-making

Implementation

1. Install OpenTelemetry

pip install opentelemetry-api opentelemetry-sdk
pip install opentelemetry-instrumentation-fastapi  # if using FastAPI
pip install opentelemetry-instrumentation-httpx    # for HTTP calls
pip install opentelemetry-exporter-otlp           # for export

2. Instrument Conduit Router

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Instrument routing decision
with tracer.start_as_current_span("bandit.select_arm") as span:
    arm = bandit.select_arm(context)
    span.set_attribute("selected_model", arm.model_name)
    span.set_attribute("ucb_score", arm.score)

3. Key Spans to Instrument

Routing Layer:

  • conduit.route - Overall routing decision
  • bandit.select_arm - Arm selection logic
  • bandit.update - Reward feedback update
  • embeddings.generate - Query embedding generation

Execution Layer:

  • model.execute - LLM API call
  • evaluation.score - Quality evaluation with Arbiter
  • cost.calculate - Cost tracking

Persistence Layer:

  • db.save_state - Bandit state persistence
  • db.load_state - State recovery

4. Attributes to Capture

span.set_attribute("query.text", query[:100])  # Truncate for privacy
span.set_attribute("query.category", category)
span.set_attribute("query.complexity", complexity)
span.set_attribute("model.selected", model_name)
span.set_attribute("model.cost", cost)
span.set_attribute("model.latency_ms", latency)
span.set_attribute("quality.score", quality)
span.set_attribute("bandit.algorithm", algo_name)
span.set_attribute("bandit.exploration", is_exploration)

5. Export to Observability Backend

Choose backend:

  • Jaeger (self-hosted, good for development)
  • Honeycomb (SaaS, excellent UX)
  • Datadog (enterprise, full APM)
  • Grafana Tempo (open-source, cost-effective)

Recommend: Jaeger for development, Honeycomb for production

Success Criteria

  • OpenTelemetry instrumentation in conduit_bench/tracing.py
  • All 10+ key operations instrumented
  • Trace export to Jaeger/Honeycomb working
  • Can visualize full request flow in trace UI
  • Latency breakdown by layer visible
  • Documentation in docs/OBSERVABILITY.md
  • Example traces in docs

Example Trace Visualization

Request [200ms total]
├─ conduit.route [180ms]
│  ├─ embeddings.generate [50ms]
│  ├─ bandit.select_arm [5ms]
│  │  └─ linucb.compute_ucb [4ms]
│  └─ model.execute [120ms]
│     └─ openai.chat.completions [115ms]
└─ evaluation.score [20ms]
   └─ arbiter.semantic_similarity [18ms]

Priority

MEDIUM - Essential for production debugging, not blocking research

Difficulty

Intermediate - Requires observability platform knowledge

Metadata

Metadata

Assignees

No one assigned

    Labels

    difficulty:intermediateIntermediate difficulty - requires domain knowledgeenhancementNew feature or requestpriority:mediumMedium priority - important but not blocking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions