-
Notifications
You must be signed in to change notification settings - Fork 19
Description
As-is
1. Purpose
1. Standardizing Data Entities
Context: Based on Issue #226, the Kāiaulu architecture was partially planned and partially built as features were added incrementally without following a centralized plan. This resulted in an "accidental architecture" where identical data entities are represented by inconsistent column names and values across different tables.
Impact: This inconsistency makes it difficult to perform inner joins across tables. Without standardized keys, the user has to guess from the notebook content what in each notebook can be inner joined to other tables.
Resolution: Establish a standardized naming convention and data representation for all tables that are relevant for Kaiaulu users, ensuring every primary and foreign key is uniform across the entire environment.
2. Establishing a Formal Schema
Context: Due to Kāiaulu’s organic growth, there was no clear vision for how data saved by one notebook links to another, causing technical debt to accrue in the schema.
Impact: Users are forced to guess relationships or read through entire notebooks to understand how datasets connect, resulting in a lack of transparency and unindexed data.
Resolution: Represent the tables and columns currently residing in Notebooks as formal entities within MySQL Workbench (.mwb). This provides an explicit, self-documenting data model where related columns share the exact same name and can be joined reliably.
3. Enforcing Future Consistency
Context: Even with a formal schema in place, new developers may inadvertently introduce inconsistent naming or data structures when creating new notebooks, which would eventually undo these architectural improvements.
Impact: Without a "guardrail," the effort to clean up the architecture will deteriorate over time, leading back to a fragmented and unmaintainable system.
Resolution: Develop a set of R functions within R/io.R to define a strict code API for data storage. This programmatic layer ensures that any data saved by a notebook adheres to the defined schema, preventing "accidental" naming deviations in the future.
2. Process
The process will use the previous, incomplete efforts from Issue #226 as a baseline, but will prioritize the current state of the code as the source of truth. The team will analyze one notebook at a time to discuss its specific purpose. I will examine the code, existing tables, and schema to identify which entities belong in the universal schema. I will propose these to @carlosparadis, explaining the reasoning for each inclusion. Maintain ongoing communication with Michelle to discuss "valuable data" findings and ensure that overlapping functions from the notebooks and .yml files share identical names for future-proofing.
Once the final tables are agreed upon, I will implement the three objectives:
-
Propose a set of tables within the MySQL Workbench Schema file
-
Normalize column names within the Notebook to match the new schema
-
Implement corresponding R export functions in R/io.R to serve as guardrails for new developers.
3. Architect
Update inconsistent function designs, parameters, and interfaces to tools. This involves refactoring legacy logic to adhere to the newly defined naming conventions and ensuring all future features continue this pattern.
Umbrella of issues.
- Refactor Gitlog Notebook Data Schema Architecture #358
- Refactor Depends/Scitools and Kaiaulu Architecture Notebook Data Schema Architecture #372
- Refactor JIRA Notebook Data Schema Architecture #373
- Refactor Mail Notebook Data Schema Architecture #374
- Refactor Bug Count Notebook Data Schema Architecture #386
- Refactor Kaiaulu Architecture Notebook Data Schema Architecture #388