-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Comprehensive Database Architecture & Schema Design (Research-3) #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat: Comprehensive Database Architecture & Schema Design (Research-3) #143
Conversation
# Motivation The **Codegen on OSS** package provides a pipeline that: - **Collects repository URLs** from different sources (e.g., CSV files or GitHub searches). - **Parses repositories** using the codegen tool. - **Profiles performance** and logs metrics for each parsing run. - **Logs errors** to help pinpoint parsing failures or performance bottlenecks. <!-- Why is this change necessary? --> # Content <!-- Please include a summary of the change --> see [codegen-on-oss/README.md](https://github.com/codegen-sh/codegen-sdk/blob/acfe3dc07b65670af33b977fa1e7bc8627fd714e/codegen-on-oss/README.md) # Testing <!-- How was the change tested? --> `uv run modal run modal_run.py` No unit tests yet 😿 # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes - [x] I have updated the documentation or added new documentation as needed
Original commit by Tawsif Kamal: Revert "Revert "Adding Schema for Tool Outputs"" (codegen-sh#894) Reverts codegen-sh#892 --------- Co-authored-by: Rushil Patel <[email protected]> Co-authored-by: rushilpatel0 <[email protected]>
Original commit by Ellen Agarwal: fix: Workaround for relace not adding newlines (codegen-sh#907)
…-enhanced-visualization-features
…oyment-scripts
- Complete 25+ page database architecture document - 10 production-ready SQL schema files covering: * Task management and execution tracking * Codebase analysis and code relationships * Multi-platform event tracking (ClickHouse) * Project and workflow management * Evaluation and effectiveness analysis * Analytics and performance metrics * Inter-entity relationship mapping * Caching and optimization * Audit trails and compliance * Advanced indexing strategies - Database initialization and migration system - Python database abstraction layer interface - Hybrid PostgreSQL + ClickHouse architecture - Support for Graph-Sitter, Codegen SDK, and Contexten integration - Comprehensive documentation and setup guides Addresses ZAM-1017: Research-3 database architecture requirements
Reviewer's GuideThis PR establishes a full‐scale hybrid PostgreSQL/ClickHouse database architecture by delivering a comprehensive design document and README, production-ready SQL schema files for all core modules, an initialization/migration framework, and a unified Python abstraction layer. Sequence Diagram: ETL Data Flow from PostgreSQL to ClickHousesequenceDiagram
participant PG as PostgreSQL
participant ETL as ETL Process
participant CH as ClickHouse
Note over PG, CH: Initial data written to PostgreSQL (OLTP)
PG ->>+ ETL: Data changes / new data available (e.g., from event_staging)
ETL ->> ETL: Extract relevant data
ETL ->> ETL: Transform data for analytical workloads
ETL ->>+ CH: Load transformed data into OLAP tables (e.g., events, analytics_aggregates)
CH -->>- ETL: Acknowledge data load
ETL -->>- PG: Update staging tables (e.g., mark as processed)
Note over PG, CH: Analytical queries now use ClickHouse (OLAP)
ER Diagram for Projects Schema (projects_schema.sql)erDiagram
projects {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR name
VARCHAR status
BIGINT owner_id "FK to users"
BIGINT created_by "FK to users"
}
project_teams {
BIGSERIAL id PK
BIGINT project_id FK
BIGINT user_id "FK to users"
VARCHAR role
BIGINT added_by "FK to users"
}
project_milestones {
BIGSERIAL id PK
BIGINT project_id FK
VARCHAR name
VARCHAR status
BIGINT created_by "FK to users"
}
workflows {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
BIGINT project_id FK "nullable"
VARCHAR name
JSONB definition
VARCHAR status
BIGINT created_by "FK to users"
}
workflow_executions {
BIGSERIAL id PK
BIGINT workflow_id FK
BIGINT organization_id "FK to organizations"
BIGINT project_id "nullable FK to projects"
BIGINT triggered_by_user_id "nullable FK to users"
VARCHAR status
}
workflow_step_executions {
BIGSERIAL id PK
BIGINT execution_id FK
VARCHAR step_name
VARCHAR status
}
project_metrics {
BIGSERIAL id PK
BIGINT project_id FK
DATE metric_date
INT total_tasks
}
project_reports {
BIGSERIAL id PK
BIGINT project_id FK
VARCHAR report_name
BIGINT generated_by "FK to users"
}
project_templates {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR name
JSONB template_data
BIGINT created_by "FK to users"
}
projects ||--o{ project_teams : "has"
projects ||--o{ project_milestones : "has"
projects }o--o{ workflows : "defines"
projects ||--o{ project_metrics : "tracks"
projects ||--o{ project_reports : "generates"
workflows ||--o{ workflow_executions : "runs"
workflow_executions ||--o{ workflow_step_executions : "contains_steps"
workflow_executions }o--|| projects : "executes_for"
ER Diagram for Analytics Schema (analytics_schema.sql)erDiagram
daily_analytics {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
DATE metric_date
INT tasks_created
}
weekly_analytics {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
DATE week_start_date
INT total_tasks_created
}
monthly_analytics {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR month_year
INT objectives_completed
}
realtime_metrics {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR metric_name
DECIMAL metric_value
}
performance_metrics {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR service_name
DECIMAL response_time_ms
BIGINT user_id "nullable FK to users"
}
dashboards {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR name
JSONB layout_config
BIGINT created_by "FK to users"
}
dashboard_widgets {
BIGSERIAL id PK
BIGINT dashboard_id FK
VARCHAR widget_name
JSONB config
}
scheduled_reports {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR name
JSONB report_config
BIGINT created_by "FK to users"
}
report_executions {
BIGSERIAL id PK
BIGINT scheduled_report_id FK
VARCHAR execution_status
}
metric_calculations {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR calculation_name
BIGINT created_by "FK to users"
}
trend_analysis {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR metric_name
VARCHAR trend_direction
}
correlation_analysis {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR metric_a
VARCHAR metric_b
DECIMAL correlation_coefficient
}
dashboards ||--o{ dashboard_widgets : "contains"
scheduled_reports ||--o{ report_executions : "has"
ER Diagram for Codebases Schema (codebases_schema.sql)erDiagram
codebases {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
BIGINT project_id "nullable FK to projects"
VARCHAR name
VARCHAR status
BIGINT created_by "FK to users"
}
code_files {
BIGSERIAL id PK
BIGINT codebase_id FK
VARCHAR file_path
VARCHAR language
DECIMAL complexity_score
}
code_symbols {
BIGSERIAL id PK
BIGINT file_id FK
BIGINT codebase_id FK
VARCHAR name
VARCHAR symbol_type
BIGINT parent_symbol_id "nullable FK to code_symbols"
}
code_relationships {
BIGSERIAL id PK
BIGINT codebase_id FK
VARCHAR source_type
BIGINT source_file_id "nullable FK to code_files"
VARCHAR target_type
BIGINT target_file_id "nullable FK to code_files"
VARCHAR relationship_type
}
codebase_analysis_sessions {
BIGSERIAL id PK
BIGINT codebase_id FK
UUID session_id
VARCHAR status
BIGINT triggered_by "nullable FK to users"
}
codebase_quality_metrics {
BIGSERIAL id PK
BIGINT codebase_id FK
DATE metric_date
DECIMAL average_complexity
}
code_hotspots {
BIGSERIAL id PK
BIGINT codebase_id FK
BIGINT file_id FK
VARCHAR risk_level
}
external_dependencies {
BIGSERIAL id PK
BIGINT codebase_id FK
VARCHAR package_name
BOOLEAN has_vulnerabilities
}
codebases ||--o{ code_files : "contains"
codebases ||--o{ code_symbols : "defines"
codebases ||--o{ code_relationships : "has_defined"
codebases ||--o{ codebase_analysis_sessions : "undergoes"
codebases ||--o{ codebase_quality_metrics : "has"
codebases ||--o{ code_hotspots : "identifies"
codebases ||--o{ external_dependencies : "uses"
code_files ||--o{ code_symbols : "contains"
code_files }o--o{ code_relationships : "source_in"
code_files }o--o{ code_relationships : "target_in"
code_files ||--o{ code_hotspots : "can_be"
code_symbols }o--o| code_symbols : "parent_of"
ER Diagram for Relationships Schema (relationships_schema.sql)erDiagram
entity_relationships {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR source_entity_type
BIGINT source_entity_id
VARCHAR target_entity_type
BIGINT target_entity_id
VARCHAR relationship_type
BOOLEAN is_inferred
}
relationship_types {
BIGSERIAL id PK
BIGINT organization_id "nullable FK to organizations"
VARCHAR type_name UK
JSONB valid_source_types
JSONB valid_target_types
BIGINT created_by "nullable FK to users"
}
task_relationships {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
BIGINT source_task_id "FK to tasks"
BIGINT target_task_id "FK to tasks"
VARCHAR relationship_type
}
code_relationships_extended {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
BIGINT codebase_id "FK to codebases"
BIGINT source_file_id "nullable FK to code_files"
BIGINT target_file_id "nullable FK to code_files"
VARCHAR relationship_type
}
user_relationships {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
BIGINT source_user_id "FK to users"
BIGINT target_user_id "FK to users"
VARCHAR relationship_type
}
relationship_graphs {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR graph_name
JSONB nodes
JSONB edges
}
relationship_patterns {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
VARCHAR pattern_name
JSONB pattern_structure
}
relationship_metrics {
BIGSERIAL id PK
BIGINT organization_id "FK to organizations"
DATE metric_date
INT total_relationships
}
entity_relationships }o--|| relationship_types : "type_governed_by (via type_name)"
ER Diagram for Cache Schema (cache_schema.sql)erDiagram
cache_configurations {
BIGSERIAL id PK
VARCHAR cache_key UK
INT ttl_seconds
VARCHAR invalidation_pattern
INT max_size_mb
}
cached_results {
BIGSERIAL id PK
VARCHAR cache_key "Refers to cache_configurations.cache_key"
VARCHAR query_hash
JSONB result_data
INT result_size_bytes
TIMESTAMP expires_at
TIMESTAMP last_accessed_at
}
cache_statistics {
BIGSERIAL id PK
VARCHAR cache_key "Refers to cache_configurations.cache_key"
DATE date
INT hit_count
INT miss_count
DECIMAL total_size_mb
}
cache_configurations ||--o{ cached_results : "defines_behavior_for"
cache_configurations ||--o{ cache_statistics : "tracks_stats_for"
Class Diagram: DatabaseInterface (database_interface.py)classDiagram
class DatabaseInterface {
<<Interface>>
+async create_task(task_data: TaskCreate) Task
+async get_task(task_id: int, organization_id: int) Optional~Task~
+async update_task(task_id: int, updates: TaskUpdate) Task
+async delete_task(task_id: int, organization_id: int) bool
+async search_tasks(filters: TaskFilters) List~Task~
}
note for DatabaseInterface "Defines a unified Python abstraction layer for database interactions."
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Important Review skippedBot user detected. To trigger a single review, invoke the You can disable this status message by setting the 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Join our Discord community for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
🎯 Research Objective Completed
This PR delivers the comprehensive database architecture design for supporting Graph-Sitter code analysis, Codegen SDK task management, and Contexten event orchestration with advanced analytics and evaluation capabilities.
📋 Deliverables Completed
✅ 1. Database Architecture Document (25+ pages)
research/database-architecture/docs/comprehensive-database-architecture.md✅ 2. Production-Ready SQL Schema Files (10 schemas)
Core Schemas
tasks_schema.sql- Task management and execution tracking with automationcodebases_schema.sql- Repository and code analysis data with Graph-Sitter integrationevents_schema.sql- Multi-platform event tracking (ClickHouse) for Linear, Slack, GitHub, deploymentsprojects_schema.sql- Project and workflow management with team collaborationevaluations_schema.sql- Effectiveness and outcome analysis with AI agent performance trackingSupporting Schemas
analytics_schema.sql- Performance metrics, dashboards, and real-time analyticsrelationships_schema.sql- Inter-entity relationship mapping with graph analysiscache_schema.sql- Query optimization and result cachingaudit_schema.sql- Change tracking, audit trails, and complianceindexes_schema.sql- Advanced indexing strategies for performance✅ 3. Database Initialization System
migrations/001_initial_setup.sql- Database setup and migration framework✅ 4. Integration Interfaces
interfaces/database_interface.py- Complete Python database abstraction layer🏗️ Architecture Highlights
Hybrid Database Strategy
Key Features
Integration Support
📊 Success Criteria Met
🔧 Implementation Quality
Database Design Principles
Performance Optimization
Security & Compliance
🚀 Next Steps
This database architecture is ready for implementation in Core-6 (Database Implementation). The design provides:
📁 Files Changed
Total: 14 new files, 5,818+ lines of production-ready code and documentation
🎯 This completes Research-3 requirements and provides the foundation for all data storage and analytics in the integrated system.
💻 View my work • About Codegen
Note
I'm currently writing a description for your pull request. I should be done shortly (<1 minute). Please don't edit the description field until I'm finished, or we may overwrite each other. If I find nothing to write about, I'll delete this message.
Summary by Sourcery
Add a complete, production-ready database architecture for supporting tasks, projects, code analysis, events, analytics, relationships, caching, auditing, and evaluation with a hybrid PostgreSQL and ClickHouse strategy.
New Features:
Enhancements:
Documentation: