|
| 1 | +# SPEC-12: OpenTelemetry Observability |
| 2 | + |
| 3 | +## Why |
| 4 | + |
| 5 | +We need comprehensive observability for basic-memory-cloud to: |
| 6 | +- Track request flows across our multi-tenant architecture (MCP → Cloud → API services) |
| 7 | +- Debug performance issues and errors in production |
| 8 | +- Understand user behavior and system usage patterns |
| 9 | +- Correlate issues to specific tenants for targeted debugging |
| 10 | +- Monitor service health and latency across the distributed system |
| 11 | + |
| 12 | +Currently, we only have basic logging without request correlation or distributed tracing capabilities. |
| 13 | + |
| 14 | +## What |
| 15 | + |
| 16 | +Implement OpenTelemetry instrumentation across all basic-memory-cloud services with: |
| 17 | + |
| 18 | +### Core Requirements |
| 19 | +1. **Distributed Tracing**: End-to-end request tracing from MCP gateway through to tenant API instances |
| 20 | +2. **Tenant Correlation**: All traces tagged with tenant_id, user_id, and workos_user_id |
| 21 | +3. **Service Identification**: Clear service naming and namespace separation |
| 22 | +4. **Auto-instrumentation**: Automatic tracing for FastAPI, SQLAlchemy, HTTP clients |
| 23 | +5. **Grafana Cloud Integration**: Direct OTLP export to Grafana Cloud Tempo |
| 24 | + |
| 25 | +### Services to Instrument |
| 26 | +- **MCP Gateway** (basic-memory-mcp): Entry point with JWT extraction |
| 27 | +- **Cloud Service** (basic-memory-cloud): Provisioning and management operations |
| 28 | +- **API Service** (basic-memory-api): Tenant-specific instances |
| 29 | +- **Worker Processes** (ARQ workers): Background job processing |
| 30 | + |
| 31 | +### Key Trace Attributes |
| 32 | +- `tenant.id`: UUID from UserProfile.tenant_id |
| 33 | +- `user.id`: WorkOS user identifier |
| 34 | +- `user.email`: User email for debugging |
| 35 | +- `service.name`: Specific service identifier |
| 36 | +- `service.namespace`: Environment (development/production) |
| 37 | +- `operation.type`: Business operation (provision/update/delete) |
| 38 | +- `tenant.app_name`: Fly.io app name for tenant instances |
| 39 | + |
| 40 | +## How |
| 41 | + |
| 42 | +### Phase 1: Setup OpenTelemetry SDK |
| 43 | +1. Add OpenTelemetry dependencies to each service's pyproject.toml: |
| 44 | + ```python |
| 45 | + "opentelemetry-distro[otlp]>=1.29.0", |
| 46 | + "opentelemetry-instrumentation-fastapi>=0.50b0", |
| 47 | + "opentelemetry-instrumentation-httpx>=0.50b0", |
| 48 | + "opentelemetry-instrumentation-sqlalchemy>=0.50b0", |
| 49 | + "opentelemetry-instrumentation-logging>=0.50b0", |
| 50 | + ``` |
| 51 | + |
| 52 | +2. Create shared telemetry initialization module (`apps/shared/telemetry.py`) |
| 53 | + |
| 54 | +3. Configure Grafana Cloud OTLP endpoint via environment variables: |
| 55 | + ```bash |
| 56 | + OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-2.grafana.net/otlp |
| 57 | + OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic[token] |
| 58 | + OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf |
| 59 | + ``` |
| 60 | + |
| 61 | +### Phase 2: Instrument MCP Gateway |
| 62 | +1. Extract tenant context from AuthKit JWT in middleware |
| 63 | +2. Create root span with tenant attributes |
| 64 | +3. Propagate trace context to downstream services via headers |
| 65 | + |
| 66 | +### Phase 3: Instrument Cloud Service |
| 67 | +1. Continue trace from MCP gateway |
| 68 | +2. Add operation-specific attributes (provisioning events) |
| 69 | +3. Instrument ARQ worker jobs for async operations |
| 70 | +4. Track Fly.io API calls and latency |
| 71 | + |
| 72 | +### Phase 4: Instrument API Service |
| 73 | +1. Extract tenant context from JWT |
| 74 | +2. Add machine-specific metadata (instance ID, region) |
| 75 | +3. Instrument database operations with SQLAlchemy |
| 76 | +4. Track MCP protocol operations |
| 77 | + |
| 78 | +### Phase 5: Configure and Deploy |
| 79 | +1. Add OTLP configuration to `.env.example` and `.env.example.secrets` |
| 80 | +2. Set Fly.io secrets for production deployment |
| 81 | +3. Update Dockerfiles to use `opentelemetry-instrument` wrapper |
| 82 | +4. Deploy to development environment first for testing |
| 83 | + |
| 84 | +## How to Evaluate |
| 85 | + |
| 86 | +### Success Criteria |
| 87 | +1. **End-to-end traces visible in Grafana Cloud** showing complete request flow |
| 88 | +2. **Tenant filtering works** - Can filter traces by tenant_id to see all requests for a user |
| 89 | +3. **Service maps accurate** - Grafana shows correct service dependencies |
| 90 | +4. **Performance overhead < 5%** - Minimal latency impact from instrumentation |
| 91 | +5. **Error correlation** - Can trace errors back to specific tenant and operation |
| 92 | + |
| 93 | +### Testing Checklist |
| 94 | +- [x] Single request creates connected trace across all services |
| 95 | +- [x] Tenant attributes present on all spans |
| 96 | +- [x] Background jobs (ARQ) appear in traces |
| 97 | +- [x] Database queries show in trace timeline |
| 98 | +- [x] HTTP calls to Fly.io API tracked |
| 99 | +- [x] Traces exported successfully to Grafana Cloud |
| 100 | +- [x] Can search traces by tenant_id in Grafana |
| 101 | +- [x] Service dependency graph shows correct flow |
| 102 | + |
| 103 | +### Monitoring Success |
| 104 | +- All services reporting traces to Grafana Cloud |
| 105 | +- No OTLP export errors in logs |
| 106 | +- Trace sampling working correctly (if implemented) |
| 107 | +- Resource usage acceptable (CPU/memory) |
| 108 | + |
| 109 | +## Dependencies |
| 110 | +- Grafana Cloud account with OTLP endpoint configured |
| 111 | +- OpenTelemetry Python SDK v1.29.0+ |
| 112 | +- FastAPI instrumentation compatibility |
| 113 | +- Network access from Fly.io to Grafana Cloud |
| 114 | + |
| 115 | +## Implementation Assignment |
| 116 | +**Recommended Agent**: python-developer |
| 117 | +- Requires Python/FastAPI expertise |
| 118 | +- Needs understanding of distributed systems |
| 119 | +- Must implement middleware and context propagation |
| 120 | +- Should understand OpenTelemetry SDK and instrumentation |
| 121 | + |
| 122 | +## Follow-up Tasks |
| 123 | + |
| 124 | +### Enhanced Log Correlation |
| 125 | +While basic trace-to-log correlation works automatically via OpenTelemetry logging instrumentation, consider adding structured logging for improved log filtering: |
| 126 | + |
| 127 | +1. **Structured Logging Context**: Add `logger.bind()` calls to inject tenant/user context directly into log records |
| 128 | +2. **Custom Loguru Formatter**: Extract OpenTelemetry span attributes for better log readability |
| 129 | +3. **Direct Log Filtering**: Enable searching logs directly by tenant_id, workflow_id without going through traces |
| 130 | + |
| 131 | +This would complement the existing automatic trace correlation and provide better log search capabilities. |
| 132 | + |
| 133 | +## Alternative Solution: Logfire |
| 134 | + |
| 135 | +After implementing OpenTelemetry with Grafana Cloud, we discovered limitations in the observability experience: |
| 136 | +- Traces work but lack useful context without correlated logs |
| 137 | +- Setting up log correlation with Grafana is complex and requires additional infrastructure |
| 138 | +- The developer experience for Python observability is suboptimal |
| 139 | + |
| 140 | +### Logfire Evaluation |
| 141 | + |
| 142 | +**Pydantic Logfire** offers a compelling alternative that addresses your specific requirements: |
| 143 | + |
| 144 | +#### Core Requirements Match |
| 145 | +- ✅ **User Activity Tracking**: Automatic request tracing with business context |
| 146 | +- ✅ **Error Monitoring**: Built-in exception tracking with full context |
| 147 | +- ✅ **Performance Metrics**: Automatic latency and performance monitoring |
| 148 | +- ✅ **Request Tracing**: Native distributed tracing across services |
| 149 | +- ✅ **Log Correlation**: Seamless trace-to-log correlation without setup |
| 150 | + |
| 151 | +#### Key Advantages |
| 152 | +1. **Python-First Design**: Built specifically for Python/FastAPI applications by the Pydantic team |
| 153 | +2. **Simple Integration**: `pip install logfire` + `logfire.configure()` vs complex OTLP setup |
| 154 | +3. **Automatic Correlation**: Logs automatically include trace context without manual configuration |
| 155 | +4. **Real-time SQL Interface**: Query spans and logs using SQL with auto-completion |
| 156 | +5. **Better Developer UX**: Purpose-built observability UI vs generic Grafana dashboards |
| 157 | +6. **Loguru Integration**: `logger.configure(handlers=[logfire.loguru_handler()])` maintains existing logging |
| 158 | + |
| 159 | +#### Pricing Assessment |
| 160 | +- **Free Tier**: 10M spans/month (suitable for development and small production workloads) |
| 161 | +- **Transparent Pricing**: $1 per million spans/metrics after free tier |
| 162 | +- **No Hidden Costs**: No per-host fees, only usage-based metering |
| 163 | +- **Production Ready**: Recently exited beta, enterprise features available |
| 164 | + |
| 165 | +#### Migration Path |
| 166 | +The existing OpenTelemetry instrumentation is compatible - Logfire uses OpenTelemetry under the hood, so the current spans and attributes would work unchanged. |
| 167 | + |
| 168 | +### Recommendation |
| 169 | + |
| 170 | +**Consider migrating to Logfire** for the following reasons: |
| 171 | +1. It directly addresses the "next to useless" traces problem by providing integrated logs |
| 172 | +2. Dramatically simpler setup and maintenance compared to Grafana Cloud + custom log correlation |
| 173 | +3. Better ROI on observability investment with purpose-built Python tooling |
| 174 | +4. Free tier sufficient for current development needs with clear scaling path |
| 175 | + |
| 176 | +The current Grafana Cloud implementation provides a solid foundation and could remain as a backup/export target, while Logfire becomes the primary observability platform. |
| 177 | + |
| 178 | +## Status |
| 179 | +**Created**: 2024-01-28 |
| 180 | +**Status**: Completed (OpenTelemetry + Grafana Cloud) |
| 181 | +**Next Phase**: Evaluate Logfire migration |
| 182 | +**Priority**: High - Critical for production observability |
0 commit comments