-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Background
The Postgres implementation of betydb had a number of data constraints. These did a great job of enforcing data integrity (value ranges, foreign keys, uniqueness, etc). When migrating to a CSV based dataset, these database constraints are no longer automatically enforced.
Constraints are
- described / documented in Constraints for BETYdb.
- Enumerated in the Constraints Spreadsheet
- Implemented in the postgres schema structure db/structure.sql in the bety repository, but these were not completely implemented.
Scope
It is not necessary to replicate all constraints, which would be a lot of work with diminishing returns (there are a lot!).
- translate PostgreSQL constraints to validation within R, and run by GH Actions
- focus on useful constraints only
Approach
- Prioritize constraints
- Discuss implementation approach here. Some options:
- validation functions, e.g. in data-raw/validation.R, called by data-raw/make-data.R
- testthat tests, e.g. tests/testthat/test-data-constraints.R
- combination of above
- Other
Large chunks of this work, especially translating constraints to R, may be well suited to LLMs because they are formally defined and the most important ones are implemented in SQL.
Priorities
General Approach
Prioritize constraints that:
- prevent real data corruption
- avoid complex cross-table logic
- are easy to understand and maintain
Value constraints
- numeric ranges (e.g., percentages between 0–100, precipitation ≥0)
- positive counts (
n ≥ 0) - sanity bounds for variables (min, max from variables table)
Uniqueness constraints
- natural keys that prevent duplicate rows
Non-NULL constraints
On critical fields required to interpret measurements; natural keys.
Standardization
- whitespace normalization
- canonical values for units or categorical variables
Deliverables
- R validation functions implementing key constraints
- automated tests ensuring they run in CI
- documentation describing which constraints are enforced