Skip to content

OpenTelemetry Wide Events Architecture Implementation PlanΒ #481

@aryanranderiya

Description

@aryanranderiya

Executive Summary

Implement OpenTelemetry-based wide events logging for GAIA FastAPI backend with zero boilerplate through automatic instrumentation.

Key Innovations

  1. Native Context Propagation: Python 3.12 automatically propagates context - no wrappers needed
  2. Zero Boilerplate: Wide events via request.state.wide_event - no manual get_*() calls
  3. Auto Database Instrumentation: OpenTelemetry auto-instrumentation - no decorators
  4. Pattern-Based Approach: Learn principles, apply to any route

Timeline

  • Foundation + Infrastructure: 2-3 days
  • Route Instrumentation: 2-3 weeks (all routes, applying patterns)
  • Testing & Refinement: 3-5 days
  • Total: 3-4 weeks with 2 developers

πŸ“‹ Plan Sections

The full implementation plan includes:

  1. Trace Backend Selection - Comparison of Jaeger, Tempo, Honeycomb, DataDog, GCP Cloud Trace with cost estimates
  2. Architecture Overview - Core components, file structure, context propagation
  3. Critical Implementation Files - Detailed breakdown of observability package
  4. Business Context Guide - How to identify and log business context by domain
  5. Route Logging Patterns - 7 patterns for different route types
  6. Implementation Phases - Phase 0-3 with deliverables
  7. Migration Strategy - How to transition from 2,020 Loguru log statements
  8. Error Handling & Degradation - Circuit breakers, graceful degradation, health monitoring
  9. LangGraph Agent Instrumentation - Full callback implementation for agent tracing
  10. Security & PII Handling - Sanitization, GDPR compliance, security checklist
  11. Team Onboarding - Developer guide, code review checklist, training agenda
  12. Monitoring OTel Health - Metrics, alerting for the instrumentation itself
  13. Query Examples & Dashboards - Real queries and dashboard recommendations
  14. GCP Deployment Integration - Build args, metadata extraction, GitHub Actions
  15. Configuration - Environment variables, sampling strategy
  16. Verification & Testing - End-to-end trace verification
  17. Success Metrics - Measurable outcomes

πŸ’‘ Trace Backend Recommendation

Development: Jaeger (local Docker Compose, free)

Production: Grafana Tempo + GCP Cloud Trace hybrid

  • Cost: $40-60/month (Tempo) + Free tier (Cloud Trace)
  • Rationale: High cardinality support, tail sampling, cost-effective

Alternative: Honeycomb ($500-1,000/month with aggressive sampling)

  • Benefit: Best-in-class query experience, faster debugging

🎯 Quick Wins

Week 1: Foundation

  • Deploy OTel middleware
  • Auto-instrument databases (MongoDB, PostgreSQL, Redis)
  • Wide events automatically capture environment, request, user context

Week 2-3: High-Impact Routes

  • Chat/conversation routes (high volume)
  • Payment routes (high business value)
  • Workflow routes (complex debugging needs)

Week 4: Validation & Rollout

  • Compare wide events vs Loguru logs
  • Verify end-to-end traces
  • Gradual rollout with sampling

πŸ“ Full Plan Location

Full plan: /home/aryan/.claude/plans/dreamy-discovering-porcupine.md

The full plan is ~1300 lines with:

  • Complete code examples for all patterns
  • Full LangGraph callback implementation
  • Sanitization functions with GDPR compliance
  • Real query examples for Honeycomb/Grafana/DataDog
  • Dashboard recommendations

βœ… Next Steps

  1. Review and approve this plan
  2. Set up trace backend (Jaeger for dev, Tempo/Cloud Trace for prod)
  3. Phase 0: Implement foundation (2-3 days)
  4. Phase 1: Auth middleware enrichment (1 day)
  5. Phase 2: Apply patterns to all routes (2-3 weeks)
  6. Phase 3: Testing and refinement (3-5 days)
  7. Deploy with gradual rollout

πŸ“Š Success Metrics

  • 100% route coverage (all routes emit wide events)
  • Business context present in all domain operations
  • End-to-end traces work across async boundaries
  • Context propagates through Redis pub/sub
  • ARQ workers receive and restore trace context
  • Performance impact <5% latency increase
  • Span cardinality <1000 unique spans
  • Trace backend operational with <1% error rate
  • Team trained on adding business context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions