Skip to content

Commit 53fb13b

Browse files
phernandezclaude
andauthored
fix: Make project creation endpoint idempotent (#357)
Signed-off-by: phernandez <[email protected]> Co-authored-by: Claude <[email protected]>
1 parent bd6c834 commit 53fb13b

13 files changed

+2180
-844
lines changed

specs/SPEC-10 Unified Deployment Workflow and Event Tracking.md

Lines changed: 569 additions & 0 deletions
Large diffs are not rendered by default.

specs/SPEC-11 Basic Memory API Performance Optimization.md

Lines changed: 10 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,6 @@ HTTP requests to the API suffer from 350ms-2.6s latency overhead **before** any
3131

3232
This creates compounding effects with tenant auto-start delays and increases timeout risk in cloud deployments.
3333

34-
Github issue: https://github.com/basicmachines-co/basic-memory-cloud/issues/82
35-
3634
## What
3735

3836
This optimization affects the **core basic-memory repository** components:
@@ -170,76 +168,19 @@ Validation Checklist
170168
- Documentation: Performance optimization documented in README
171169
- Cloud Integration: basic-memory-cloud sees performance benefits
172170

173-
## Implementation Status ✅ COMPLETED
174-
175-
**Implementation Date**: 2025-09-26
176-
**Branch**: `feature/spec-11-api-performance-optimization`
177-
**Commit**: `771f60b`
178-
179-
### ✅ Phase 1: Database Connection Caching - IMPLEMENTED
180-
181-
**Files Modified:**
182-
- `src/basic_memory/api/app.py` - Added database connection caching in app.state
183-
- `src/basic_memory/deps.py` - Updated get_engine_factory() to use cached connections
184-
- `src/basic_memory/config.py` - Added skip_initialization_sync configuration flag
185-
186-
**Implementation Details:**
187-
1. **API Lifespan Caching**: Database engine and session_maker cached in app.state during startup
188-
2. **Dependency Injection Optimization**: get_engine_factory() now returns cached connections instead of calling get_or_create_db()
189-
3. **Project Reconciliation Removal**: Eliminated expensive reconcile_projects_with_config() from API startup
190-
4. **CLI Fallback Preserved**: Non-API contexts continue to work with fallback database initialization
191-
192-
### ✅ Performance Validation - ACHIEVED
193-
194-
**Live Testing Results** (2025-09-26 14:03-14:09):
195-
196-
| Operation | Before | After | Improvement |
197-
|-----------|--------|-------|-------------|
198-
| `read_note` | 350ms-2.6s | **20ms** | **95-99% faster** |
199-
| `edit_note` | 350ms-2.6s | **218ms** | **75-92% faster** |
200-
| `search_notes` | 350ms-2.6s | **<500ms** | **Responsive** |
201-
| `list_memory_projects` | N/A | **<100ms** | **Fast** |
202-
203-
**Key Achievements:**
204-
-**95-99% improvement** in read operations (primary workflow)
205-
-**75-92% improvement** in edit operations
206-
-**Zero overhead** for project switching
207-
-**Database connection overhead eliminated** (0ms vs 50-100ms)
208-
-**Project reconciliation delays removed** from API requests
209-
-**<500ms target achieved** for all operations except write (which includes file sync)
210-
211-
### ✅ Backwards Compatibility - MAINTAINED
212-
213-
- All existing functionality preserved
214-
- CLI operations unaffected
215-
- Fallback for non-API contexts maintained
216-
- No breaking changes to existing APIs
217-
- Optional configuration with safe defaults
218-
219-
### ✅ Testing Validation - PASSED
220-
221-
- Integration tests passing
222-
- Type checking clear
223-
- Linting checks passed
224-
- Live testing with real MCP tools successful
225-
- Multi-project workflows validated
226-
- Rapid project switching validated
227-
228-
## Notes
171+
Notes
229172

230173
Implementation Priority:
231-
- Phase 1 COMPLETED: Database connection caching provides 95%+ performance gains
232-
- Phase 2 NOT NEEDED: Project reconciliation removal achieved the goals
233-
- Phase 3 INCLUDED: skip_initialization_sync flag added
174+
- Phase 1 provides 80% of performance gains and should be implemented first
175+
- Phase 2 provides remaining 20% and addresses edge cases
176+
- Phase 3 is optional for maximum cloud optimization
234177

235178
Risk Mitigation:
236-
- All changes backwards compatible implemented
237-
- Gradual implementation successful (Phase 1 → validation)
238-
- Easy rollback via configuration flags available
179+
- All changes backwards compatible
180+
- Gradual rollout possible (Phase 1 → 2 → 3)
181+
- Easy rollback via configuration flags
239182

240183
Cloud Integration:
241-
- ✅ This optimization directly addresses basic-memory-cloud issue #82
242-
- ✅ Changes in core basic-memory will benefit all cloud tenants
243-
- ✅ No changes needed in basic-memory-cloud itself
244-
245-
**Result**: SPEC-11 performance optimizations successfully implemented and validated. The 95-99% improvement in MCP tool response times exceeds the original 50-80% target, providing exceptional performance gains for cloud deployments and local usage.
184+
- This optimization directly addresses basic-memory-cloud issue #82
185+
- Changes in core basic-memory will benefit all cloud tenants
186+
- No changes needed in basic-memory-cloud itself
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# SPEC-12: OpenTelemetry Observability
2+
3+
## Why
4+
5+
We need comprehensive observability for basic-memory-cloud to:
6+
- Track request flows across our multi-tenant architecture (MCP → Cloud → API services)
7+
- Debug performance issues and errors in production
8+
- Understand user behavior and system usage patterns
9+
- Correlate issues to specific tenants for targeted debugging
10+
- Monitor service health and latency across the distributed system
11+
12+
Currently, we only have basic logging without request correlation or distributed tracing capabilities.
13+
14+
## What
15+
16+
Implement OpenTelemetry instrumentation across all basic-memory-cloud services with:
17+
18+
### Core Requirements
19+
1. **Distributed Tracing**: End-to-end request tracing from MCP gateway through to tenant API instances
20+
2. **Tenant Correlation**: All traces tagged with tenant_id, user_id, and workos_user_id
21+
3. **Service Identification**: Clear service naming and namespace separation
22+
4. **Auto-instrumentation**: Automatic tracing for FastAPI, SQLAlchemy, HTTP clients
23+
5. **Grafana Cloud Integration**: Direct OTLP export to Grafana Cloud Tempo
24+
25+
### Services to Instrument
26+
- **MCP Gateway** (basic-memory-mcp): Entry point with JWT extraction
27+
- **Cloud Service** (basic-memory-cloud): Provisioning and management operations
28+
- **API Service** (basic-memory-api): Tenant-specific instances
29+
- **Worker Processes** (ARQ workers): Background job processing
30+
31+
### Key Trace Attributes
32+
- `tenant.id`: UUID from UserProfile.tenant_id
33+
- `user.id`: WorkOS user identifier
34+
- `user.email`: User email for debugging
35+
- `service.name`: Specific service identifier
36+
- `service.namespace`: Environment (development/production)
37+
- `operation.type`: Business operation (provision/update/delete)
38+
- `tenant.app_name`: Fly.io app name for tenant instances
39+
40+
## How
41+
42+
### Phase 1: Setup OpenTelemetry SDK
43+
1. Add OpenTelemetry dependencies to each service's pyproject.toml:
44+
```python
45+
"opentelemetry-distro[otlp]>=1.29.0",
46+
"opentelemetry-instrumentation-fastapi>=0.50b0",
47+
"opentelemetry-instrumentation-httpx>=0.50b0",
48+
"opentelemetry-instrumentation-sqlalchemy>=0.50b0",
49+
"opentelemetry-instrumentation-logging>=0.50b0",
50+
```
51+
52+
2. Create shared telemetry initialization module (`apps/shared/telemetry.py`)
53+
54+
3. Configure Grafana Cloud OTLP endpoint via environment variables:
55+
```bash
56+
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-2.grafana.net/otlp
57+
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic[token]
58+
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
59+
```
60+
61+
### Phase 2: Instrument MCP Gateway
62+
1. Extract tenant context from AuthKit JWT in middleware
63+
2. Create root span with tenant attributes
64+
3. Propagate trace context to downstream services via headers
65+
66+
### Phase 3: Instrument Cloud Service
67+
1. Continue trace from MCP gateway
68+
2. Add operation-specific attributes (provisioning events)
69+
3. Instrument ARQ worker jobs for async operations
70+
4. Track Fly.io API calls and latency
71+
72+
### Phase 4: Instrument API Service
73+
1. Extract tenant context from JWT
74+
2. Add machine-specific metadata (instance ID, region)
75+
3. Instrument database operations with SQLAlchemy
76+
4. Track MCP protocol operations
77+
78+
### Phase 5: Configure and Deploy
79+
1. Add OTLP configuration to `.env.example` and `.env.example.secrets`
80+
2. Set Fly.io secrets for production deployment
81+
3. Update Dockerfiles to use `opentelemetry-instrument` wrapper
82+
4. Deploy to development environment first for testing
83+
84+
## How to Evaluate
85+
86+
### Success Criteria
87+
1. **End-to-end traces visible in Grafana Cloud** showing complete request flow
88+
2. **Tenant filtering works** - Can filter traces by tenant_id to see all requests for a user
89+
3. **Service maps accurate** - Grafana shows correct service dependencies
90+
4. **Performance overhead < 5%** - Minimal latency impact from instrumentation
91+
5. **Error correlation** - Can trace errors back to specific tenant and operation
92+
93+
### Testing Checklist
94+
- [x] Single request creates connected trace across all services
95+
- [x] Tenant attributes present on all spans
96+
- [x] Background jobs (ARQ) appear in traces
97+
- [x] Database queries show in trace timeline
98+
- [x] HTTP calls to Fly.io API tracked
99+
- [x] Traces exported successfully to Grafana Cloud
100+
- [x] Can search traces by tenant_id in Grafana
101+
- [x] Service dependency graph shows correct flow
102+
103+
### Monitoring Success
104+
- All services reporting traces to Grafana Cloud
105+
- No OTLP export errors in logs
106+
- Trace sampling working correctly (if implemented)
107+
- Resource usage acceptable (CPU/memory)
108+
109+
## Dependencies
110+
- Grafana Cloud account with OTLP endpoint configured
111+
- OpenTelemetry Python SDK v1.29.0+
112+
- FastAPI instrumentation compatibility
113+
- Network access from Fly.io to Grafana Cloud
114+
115+
## Implementation Assignment
116+
**Recommended Agent**: python-developer
117+
- Requires Python/FastAPI expertise
118+
- Needs understanding of distributed systems
119+
- Must implement middleware and context propagation
120+
- Should understand OpenTelemetry SDK and instrumentation
121+
122+
## Follow-up Tasks
123+
124+
### Enhanced Log Correlation
125+
While basic trace-to-log correlation works automatically via OpenTelemetry logging instrumentation, consider adding structured logging for improved log filtering:
126+
127+
1. **Structured Logging Context**: Add `logger.bind()` calls to inject tenant/user context directly into log records
128+
2. **Custom Loguru Formatter**: Extract OpenTelemetry span attributes for better log readability
129+
3. **Direct Log Filtering**: Enable searching logs directly by tenant_id, workflow_id without going through traces
130+
131+
This would complement the existing automatic trace correlation and provide better log search capabilities.
132+
133+
## Alternative Solution: Logfire
134+
135+
After implementing OpenTelemetry with Grafana Cloud, we discovered limitations in the observability experience:
136+
- Traces work but lack useful context without correlated logs
137+
- Setting up log correlation with Grafana is complex and requires additional infrastructure
138+
- The developer experience for Python observability is suboptimal
139+
140+
### Logfire Evaluation
141+
142+
**Pydantic Logfire** offers a compelling alternative that addresses your specific requirements:
143+
144+
#### Core Requirements Match
145+
-**User Activity Tracking**: Automatic request tracing with business context
146+
-**Error Monitoring**: Built-in exception tracking with full context
147+
-**Performance Metrics**: Automatic latency and performance monitoring
148+
-**Request Tracing**: Native distributed tracing across services
149+
-**Log Correlation**: Seamless trace-to-log correlation without setup
150+
151+
#### Key Advantages
152+
1. **Python-First Design**: Built specifically for Python/FastAPI applications by the Pydantic team
153+
2. **Simple Integration**: `pip install logfire` + `logfire.configure()` vs complex OTLP setup
154+
3. **Automatic Correlation**: Logs automatically include trace context without manual configuration
155+
4. **Real-time SQL Interface**: Query spans and logs using SQL with auto-completion
156+
5. **Better Developer UX**: Purpose-built observability UI vs generic Grafana dashboards
157+
6. **Loguru Integration**: `logger.configure(handlers=[logfire.loguru_handler()])` maintains existing logging
158+
159+
#### Pricing Assessment
160+
- **Free Tier**: 10M spans/month (suitable for development and small production workloads)
161+
- **Transparent Pricing**: $1 per million spans/metrics after free tier
162+
- **No Hidden Costs**: No per-host fees, only usage-based metering
163+
- **Production Ready**: Recently exited beta, enterprise features available
164+
165+
#### Migration Path
166+
The existing OpenTelemetry instrumentation is compatible - Logfire uses OpenTelemetry under the hood, so the current spans and attributes would work unchanged.
167+
168+
### Recommendation
169+
170+
**Consider migrating to Logfire** for the following reasons:
171+
1. It directly addresses the "next to useless" traces problem by providing integrated logs
172+
2. Dramatically simpler setup and maintenance compared to Grafana Cloud + custom log correlation
173+
3. Better ROI on observability investment with purpose-built Python tooling
174+
4. Free tier sufficient for current development needs with clear scaling path
175+
176+
The current Grafana Cloud implementation provides a solid foundation and could remain as a backup/export target, while Logfire becomes the primary observability platform.
177+
178+
## Status
179+
**Created**: 2024-01-28
180+
**Status**: Completed (OpenTelemetry + Grafana Cloud)
181+
**Next Phase**: Evaluate Logfire migration
182+
**Priority**: High - Critical for production observability

0 commit comments

Comments
 (0)