-
-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Executive Summary
Implement OpenTelemetry-based wide events logging for GAIA FastAPI backend with zero boilerplate through automatic instrumentation.
Key Innovations
- Native Context Propagation: Python 3.12 automatically propagates context - no wrappers needed
- Zero Boilerplate: Wide events via
request.state.wide_event- no manualget_*()calls - Auto Database Instrumentation: OpenTelemetry auto-instrumentation - no decorators
- Pattern-Based Approach: Learn principles, apply to any route
Timeline
- Foundation + Infrastructure: 2-3 days
- Route Instrumentation: 2-3 weeks (all routes, applying patterns)
- Testing & Refinement: 3-5 days
- Total: 3-4 weeks with 2 developers
π Plan Sections
The full implementation plan includes:
- Trace Backend Selection - Comparison of Jaeger, Tempo, Honeycomb, DataDog, GCP Cloud Trace with cost estimates
- Architecture Overview - Core components, file structure, context propagation
- Critical Implementation Files - Detailed breakdown of observability package
- Business Context Guide - How to identify and log business context by domain
- Route Logging Patterns - 7 patterns for different route types
- Implementation Phases - Phase 0-3 with deliverables
- Migration Strategy - How to transition from 2,020 Loguru log statements
- Error Handling & Degradation - Circuit breakers, graceful degradation, health monitoring
- LangGraph Agent Instrumentation - Full callback implementation for agent tracing
- Security & PII Handling - Sanitization, GDPR compliance, security checklist
- Team Onboarding - Developer guide, code review checklist, training agenda
- Monitoring OTel Health - Metrics, alerting for the instrumentation itself
- Query Examples & Dashboards - Real queries and dashboard recommendations
- GCP Deployment Integration - Build args, metadata extraction, GitHub Actions
- Configuration - Environment variables, sampling strategy
- Verification & Testing - End-to-end trace verification
- Success Metrics - Measurable outcomes
π‘ Trace Backend Recommendation
Development: Jaeger (local Docker Compose, free)
Production: Grafana Tempo + GCP Cloud Trace hybrid
- Cost: $40-60/month (Tempo) + Free tier (Cloud Trace)
- Rationale: High cardinality support, tail sampling, cost-effective
Alternative: Honeycomb ($500-1,000/month with aggressive sampling)
- Benefit: Best-in-class query experience, faster debugging
π― Quick Wins
Week 1: Foundation
- Deploy OTel middleware
- Auto-instrument databases (MongoDB, PostgreSQL, Redis)
- Wide events automatically capture environment, request, user context
Week 2-3: High-Impact Routes
- Chat/conversation routes (high volume)
- Payment routes (high business value)
- Workflow routes (complex debugging needs)
Week 4: Validation & Rollout
- Compare wide events vs Loguru logs
- Verify end-to-end traces
- Gradual rollout with sampling
π Full Plan Location
Full plan: /home/aryan/.claude/plans/dreamy-discovering-porcupine.md
The full plan is ~1300 lines with:
- Complete code examples for all patterns
- Full LangGraph callback implementation
- Sanitization functions with GDPR compliance
- Real query examples for Honeycomb/Grafana/DataDog
- Dashboard recommendations
β Next Steps
- Review and approve this plan
- Set up trace backend (Jaeger for dev, Tempo/Cloud Trace for prod)
- Phase 0: Implement foundation (2-3 days)
- Phase 1: Auth middleware enrichment (1 day)
- Phase 2: Apply patterns to all routes (2-3 weeks)
- Phase 3: Testing and refinement (3-5 days)
- Deploy with gradual rollout
π Success Metrics
- 100% route coverage (all routes emit wide events)
- Business context present in all domain operations
- End-to-end traces work across async boundaries
- Context propagates through Redis pub/sub
- ARQ workers receive and restore trace context
- Performance impact <5% latency increase
- Span cardinality <1000 unique spans
- Trace backend operational with <1% error rate
- Team trained on adding business context
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request