Skip to content

Implement (migrate) constraints and validation from postgres betydb #14

@dlebauer

Description

@dlebauer

Background

The Postgres implementation of betydb had a number of data constraints. These did a great job of enforcing data integrity (value ranges, foreign keys, uniqueness, etc). When migrating to a CSV based dataset, these database constraints are no longer automatically enforced.

Constraints are

Scope

It is not necessary to replicate all constraints, which would be a lot of work with diminishing returns (there are a lot!).

  • translate PostgreSQL constraints to validation within R, and run by GH Actions
  • focus on useful constraints only

⚠️ Do not use the original design and implementation as a checklist.
⚠️ Don't overthink it. the original set of constraints took a lot of time to develop.

Approach

  1. Prioritize constraints
  2. Discuss implementation approach here. Some options:
    1. validation functions, e.g. in data-raw/validation.R, called by data-raw/make-data.R
    2. testthat tests, e.g. tests/testthat/test-data-constraints.R
    3. combination of above
    4. Other

Large chunks of this work, especially translating constraints to R, may be well suited to LLMs because they are formally defined and the most important ones are implemented in SQL.

Priorities

General Approach

Prioritize constraints that:

  • prevent real data corruption
  • avoid complex cross-table logic
  • are easy to understand and maintain

Value constraints

  • numeric ranges (e.g., percentages between 0–100, precipitation ≥0)
  • positive counts (n ≥ 0)
  • sanity bounds for variables (min, max from variables table)

Uniqueness constraints

  • natural keys that prevent duplicate rows

Non-NULL constraints

On critical fields required to interpret measurements; natural keys.

Standardization

  • whitespace normalization
  • canonical values for units or categorical variables

Deliverables

  • R validation functions implementing key constraints
  • automated tests ensuring they run in CI
  • documentation describing which constraints are enforced

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions