Skip to content

Latest commit

 

History

History
169 lines (121 loc) · 5.83 KB

File metadata and controls

169 lines (121 loc) · 5.83 KB

The Data Commons Model Source Generator Project (Plaster)

Note

The code in this repository has been made public as-is for informational purposes. The repository may use private resources for the building and execution of the code. For example, private registries may be used for dependency resolution.

The documentation may refer to restricted URLs.

GDC internship project for generating data model source code.

Table of Contents

Purpose

This project is a drop-in replacement to the project https://github.com/NCI-GDC/gdcdatamodel, without challenges and obscurity associated with using gdcdatamodel. The resulting code will be readable, pass static and linting checks, completely remove delays from dictionary load times.

Goal

Given any compliant gdcdictionary, generate source code that can replace the gdcdatamodel runtime generated code.

Data Commons Models

The data commons are a collection of data structures representing concepts within a subject area. These data structures usually form a graph with edges as relationships to one another. The data structures and relationships are defined as JSON schema in yaml files that are distributed via a git repository. These definitions are called Dictionaries for short. The gdcdictionary is one example of a data commons with a primarily focus on cancer. Dictionaries are updated and released frequently, with each release adding or removing nodes, edges, or properties.

These data structures are converted to Python source code at runtime by the gdcdatamodel project. For example, the case yaml file will autogenerate the models.Case Python class with properties and methods matching those defined in the yaml file. The generated source codes are sqlalchemy database entities that map to tables in the database.

The psqlgraph project makes querying using these entities more uniform across different use cases, by exposing common modules, classes and functions that are useful for manipulating data stored using sqlalchemy.

Problems:

  • Runtime generated code cannot be peer reviewed or inspected. This forces developers to switch between dictionary definitions and code to understand what a particular piece of code is doing. Most projects within the center have this problem since they all rely on gdcdatamodel for the database entities.
  • Runtime generated code also means no type checking, linting and little chance of running static analysis tools like flake8
  • Runtime model code generation takes a few seconds (might be a few minutes - Qiao) to complete. This means that any project that makes use of gdcdatamodel must pay for this in one way or another. The most common is usually start up time.

In summary, most projects within the center suffer just because they rely on gdcdatamodel for database entities. The major goal of this project is to eliminate the runtime code generation feature on gdcdatamodel, thereby eliminating the above-mentioned problems.

Project Details

Requirements

  • Python >= 3.8
  • No direct dependency on any dictionary versions
  • Must expose scripts that can be invoked to generate source code
  • Must include unit and integration tests with over 80% code coverage
  • Must provide typings and pass mypy checks

Features

  • Dictionary selection and loading
  • Template management
  • Code generation
  • Scripts

Dictionary selection and loading

This module will be responsible for loading a dictionary given necessary parameters. These parameters will include:

  • A git URL
  • A target version, tag, commit or branch name
  • A label used for referencing the dictionary later

Template Management

This module will be responsible for the templates used to generate the final source code

How to use

Install plaster

pip install .

Generate gdcdictionary

plaster generate -p gdcdictionary -o "example/gdcdictionary"

Generate biodictionary

plaster generate -p biodictionary -o "example/biodictionary"

Known Issues

If you encounter a stacktrace similar to this

  File "/<path_to_gdcdatamodel2>/gdcdatamodel2/venv_plaster/lib/python3.8/site-packages/dulwich/refs.py", line 337, in __getitem__
    raise KeyError(name)
KeyError: b'refs/tags/2.6.3'

Delete ~/.gml/ directory. It will force psqlgml to clone new version of gdcdictionary.

Associated Projects

Repo Visualizer

Visualization of this repo

CI-CD

rebuild_gdcdatamodel2 Workflow

The rebuild_gdcdatamodel2 pipeline is designed to verify that Plaster can successfully integrate with and build the GDC dictionary. This job requires a manual trigger to execute.

It is safe to trigger this job manually. It does not affect production because it skips the push_datamodels_tag_to_github stage, ensuring no new version tags are created.

Unlike standard releases which publish to pypi-releases, builds triggered from Plaster publish to the snapshot repository pypi-snapshots