Skip to content

Refactor Full Data Schema #371

@john-a-flinn

Description

@john-a-flinn

As-is

Image

1. Purpose

1. Standardizing Data Entities

Context: Based on Issue #226, the Kāiaulu architecture was partially planned and partially built as features were added incrementally without following a centralized plan. This resulted in an "accidental architecture" where identical data entities are represented by inconsistent column names and values across different tables.

Impact: This inconsistency makes it difficult to perform inner joins across tables. Without standardized keys, the user has to guess from the notebook content what in each notebook can be inner joined to other tables.

Resolution: Establish a standardized naming convention and data representation for all tables that are relevant for Kaiaulu users, ensuring every primary and foreign key is uniform across the entire environment.

2. Establishing a Formal Schema

Context: Due to Kāiaulu’s organic growth, there was no clear vision for how data saved by one notebook links to another, causing technical debt to accrue in the schema.

Impact: Users are forced to guess relationships or read through entire notebooks to understand how datasets connect, resulting in a lack of transparency and unindexed data.

Resolution: Represent the tables and columns currently residing in Notebooks as formal entities within MySQL Workbench (.mwb). This provides an explicit, self-documenting data model where related columns share the exact same name and can be joined reliably.

3. Enforcing Future Consistency

Context: Even with a formal schema in place, new developers may inadvertently introduce inconsistent naming or data structures when creating new notebooks, which would eventually undo these architectural improvements.

Impact: Without a "guardrail," the effort to clean up the architecture will deteriorate over time, leading back to a fragmented and unmaintainable system.

Resolution: Develop a set of R functions within R/io.R to define a strict code API for data storage. This programmatic layer ensures that any data saved by a notebook adheres to the defined schema, preventing "accidental" naming deviations in the future.

2. Process

The process will use the previous, incomplete efforts from Issue #226 as a baseline, but will prioritize the current state of the code as the source of truth. The team will analyze one notebook at a time to discuss its specific purpose. I will examine the code, existing tables, and schema to identify which entities belong in the universal schema. I will propose these to @carlosparadis, explaining the reasoning for each inclusion. Maintain ongoing communication with Michelle to discuss "valuable data" findings and ensure that overlapping functions from the notebooks and .yml files share identical names for future-proofing.

Once the final tables are agreed upon, I will implement the three objectives:

  1. Propose a set of tables within the MySQL Workbench Schema file

  2. Normalize column names within the Notebook to match the new schema

  3. Implement corresponding R export functions in R/io.R to serve as guardrails for new developers.

3. Architect

Update inconsistent function designs, parameters, and interfaces to tools. This involves refactoring legacy logic to adhere to the newly defined naming conventions and ensuring all future features continue this pattern.

Umbrella of issues.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions