Skip to content

Remove database schemas from Specification repo #578

@mrshll1001

Description

@mrshll1001

I think we should remove the database schemas from the specification repository, along with the Github actions and behaviour which generates them.

This would mean the following files get removed:

  • build_database_mysql.sh
  • build_database_postgresql.sh
  • database/
    • database_mysql.sql
    • database_postgresql.sql
  • datapackage.json
  • requirements_build_database.in
  • schema
  • validate_examples_json_schema.sh
  • .github/workflows/build_database.yml

General high-level motivation: the specification repo should be scoped to the schemas and documentation. Storage and implementation of HSDS should be abstracted away. Having these in the repo risks implying there is a normative way to represent HSDS internally in a program, and risks the schema definitions drifting from these if there are build failures.

Some practical matters:

We've seen that there are indeed problems stemming from maintaining these database schema files, and I'm not sure we are seeing any benefit from them. It's unclear what the state of the builds are (#554, #357), and the builds cause constant failures when working with branches for governance of the specification.

This stems from how they're integrated into the repository, and conflicts with the rules we have for the governance. We also have the problem that according to #554 and #357; the database_mysql.sql file does not get updated. Meaning that it's currently out of date compared to the Specification.

The build errors stem from the tooling's attempts to generate databases and commit them back after every push. This means that governing the spec requires an additional step to override checks when merging PRs because there has been a failure. This is bad practice and trains community members to override checks to enact governance could lead to potential security issues. Also, it's mildly irritating to take these additional steps and then have github inform you of build errors after every push to every branch.

Even taken by themselves, I worry about the usefulness of these files. These are designed to be used as a single jumping off point for bootstrapping a database schema from the HSDS Schemas but their position in the build process and repository might imply otherwise to some people. They cannot be used for migrations. I feel that someone who would use these to bootstrap a database might also believe they can be used to migrate that database schema to newer versions based on upgrades of HSDS. The fact that these are inside the schema repository instead of elsewhere adds weight that these schemas are normative and useful.

I think the best way forward is to remove them from the specification repo. This improves our ability to maintain and govern HSDS and reduces ambiguity. For the types of people who want/need these types of auto-generated database schema files; we can take one of the following approaches:

  • migrate the builds and files to a dedicated repo. Add a README which clarifies the purpose and build process (e.g. don't use these for migrations). This repo can be set to track the default branch of HSDS, and could auto-build the databases from the schemas after each release.
  • Write a tutorial or how-to, on bootstrapping a database schema from the HSDS Schemas. Basically use the scripts/tools which are in the current build-process, but imbue them with the context and learning that people might need on their journey.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions