This document outlines the path from the current working prototype to a production-ready system.
A working single-user system with:
- SQLite database (knowledge graph + article storage)
- FastAPI web UI
- RSS feed ingestion
- SEC Form D filings via edgartools
- spaCy NER + LLM extraction
- Basic entity enrichment via web search
Problem: Entities like "SpaceX Dec 2025 a Series of Witz Ventures LLC" should resolve to "SpaceX"
Implementation:
- Create canonical entity table with aliases
- Implement ML-based entity linking (using sentence-transformers)
- Add parent-child company relationships
- Cross-reference with known company databases
Files to modify:
src/knowledge_graph/graph.py- Add entity resolution layersrc/extraction/- Add entity normalization post-processing
Problem: Manual "Run Pipeline" button doesn't scale
Implementation:
- Add APScheduler or Celery for background jobs
- Cron-style scheduling (hourly for news, daily for SEC)
- Job status tracking and history
- Failure alerting
Files to create:
src/scheduler/- New module for job schedulingsrc/pipeline/jobs.py- Individual job definitions
Problem: Hardcoded values scattered throughout
Implementation:
- Centralize all config in
src/config/settings.py - Environment-based configuration (dev/staging/prod)
- Secrets management (API keys, credentials)
Problem: SQLite doesn't support concurrent writes
Implementation:
- Create SQLAlchemy models
- Migration scripts (Alembic)
- Connection pooling
- Read replicas for queries
Files to modify:
src/storage/- Replace SQLite with PostgreSQLsrc/knowledge_graph/kg_store.py- Update queries
Problem: Repeated expensive queries
Implementation:
- Redis for hot entity data
- Query result caching
- Cache invalidation on updates
Problem: No audit trail, can't replay history
Implementation:
- Event log for all data changes
- Temporal versioning of entities
- Point-in-time queries
Problem: Single LLM pass can miss or hallucinate
Implementation:
- Fast pre-filter with spaCy (currently partial)
- LLM extraction only for high-signal articles
- Cross-validate with multiple models
- Confidence scoring based on agreement
Files to modify:
src/extraction/llm_extractor.py- Add ensemble logicsrc/extraction/spacy_extractor.py- Improve as pre-filter
Problem: No way to correct extraction errors
Implementation:
- Flag low-confidence extractions for review
- Admin UI for corrections
- Feedback loop to improve models
Problem: Model doesn't improve over time
Implementation:
- Track correction patterns
- Fine-tune extraction prompts based on feedback
- A/B test extraction strategies
| Source | Data Type | Priority |
|---|---|---|
| LinkedIn Jobs API | Hiring signals | High |
| Crunchbase API | Funding, company data | High |
| PitchBook API | Detailed funding rounds | Medium |
| Twitter/X API | Real-time signals | Medium |
| Company career pages | Direct hiring data | Medium |
| GitHub | Engineering team activity | Low |
- Track accuracy by source
- Weight confidence by source reliability
- Detect and handle source-specific biases
Implementation:
- FastAPI endpoints for programmatic access
- GraphQL for flexible entity queries
- API versioning
- Rate limiting
Files to create:
src/api/- New API modulesrc/api/v1/- Versioned endpoints
- OAuth 2.0 / API keys
- Role-based access control
- Audit logging
- Real-time notifications on signals
- Configurable alert rules
- Slack/Email integrations
- Salesforce connector
- HubSpot connector
- Export to CSV/Excel
- Dockerfile for web app
- docker-compose for local dev
- Kubernetes manifests for production
- Structured logging (already using structlog)
- Metrics (Prometheus)
- Distributed tracing
- Error tracking (Sentry)
- GitHub Actions for tests
- Automated deployments
- Database migrations in pipeline
┌─────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│ SEC EDGAR│ RSS/News │ LinkedIn │ Crunchbase│ Company Sites │
│ Form D │ Feeds │ API │ API │ (Careers) │
└────┬─────┴────┬─────┴────┬─────┴────┬──────┴───────┬────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ INGESTION LAYER (Kafka/SQS) │
│ • Rate limiting • Deduplication • Schema validation │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EXTRACTION LAYER │
│ • spaCy NER (fast, cheap) │
│ • LLM extraction (high-value articles only) │
│ • Confidence scoring │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ENTITY RESOLUTION │
│ • ML-based entity linking │
│ • Canonical entity database │
│ • Cross-source validation │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ KNOWLEDGE GRAPH (Neo4j/Neptune) │
│ • Entities + Relationships + Provenance │
│ • Temporal versioning │
│ • Confidence decay over time │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API LAYER │
│ • GraphQL for flexible queries │
│ • Webhooks for real-time signals │
│ • CRM integrations │
└─────────────────────────────────────────────────────────────┘
- Scheduled pipeline - Add APScheduler to run hourly
- Better entity names - Strip "a Series of..." patterns from company names
- Deduplication - Content hash to avoid re-processing
- Export to CSV - Download companies/candidates lists
- Email alerts - Daily digest of high-signal events
| Component | Current | Recommended |
|---|---|---|
| Database | SQLite | PostgreSQL + TimescaleDB |
| Cache | None | Redis |
| Queue | None | Celery + Redis / Temporal |
| Search | SQL LIKE | Elasticsearch |
| Graph DB | Custom SQLite | Neo4j (optional) |
| Monitoring | Logs only | Prometheus + Grafana |
| Deployment | Manual | Docker + Kubernetes |
When working on production features:
- Create a feature branch from
main - Reference this roadmap in PR descriptions
- Update checkboxes as features are completed
- Add tests for new functionality