Skip to content

Coordinated schema / data collection expansions to support useful retrieval provenance #335

@mbrush

Description

@mbrush

A set of related / interdependent proposals for expanding RIG and Biolink schema to support additional data collection during the ingest process. Common thread here is improved / more granular ingest / retrieval provenance that will support debugging and QA efforts.

1. Assign an id to each Edge Type in a RIG

  • Utility:
    • Ability to retrieve other info attached to each Edge Type in a RIG
      • e.g. UI can retrieve explanations
      • e.g. KGX Summary Report code can retrieve source_files used to create each edge type (super helpful for QA process - see below)
      • e.g. UI can point to autogenerated documentation pages created for each Edge Type populated with info from a RIG.
      • . . .
  • Process:
    • RIG author assigns this id when they create a RIG, using the pattern "[source infores]:#. e.g. "id: ctd:001" (content of data for these objects may evolve over time, so do not use a hash)
    • Then ingest code adds this to each edge when it is created.
  • Modeling:
    • add an 'id' property to the EdgeType class in the RIG schema
    • add an edge_type_id edge property ot the Biolink schema

2. Ability to indicate explicitly the source from which info used to create an edge was ingested

  • Utility:

    • There is a useful distinction between the primary source of an edge, and the source from which info was ingested. These are usually but not always the same (e.g. we ingest data from sources like drugmatrix and pdsp-ki via drug-cental as an aggregator)
    • Explicitly capturing this info would help provide more granular provenance - useful for debugging, internal reporting, maintenance, and showing full provenance to users
  • Process:

  • Modeling Proposals:

    • include in the Retrieval Source object, as a new 'resorce_role' value - per proposals in the PR here. No new properties required.
    • hang directly from the RetreivalSource object, or the edge - requires defining a new property

3. Ability to indicate explicitly which files from a source provided the info used to create an edge

  • Utility:
    • more granular provenance for debugging.
    • would really help with QA - to organize edge types by source, and help determine source of any questions/issues.
    • identifying edges that may need to be updated/reviewed if a source updates its data/files
  • Process:
    • this info is already included in RIGs for each edge type.
      - if we also assign an EdgeType id to each edge type object in a RIG (per the proposal above), and capture this in edge metadata, we would have a connection to pull the source file(s). There could be a generic script that does this as a post-processing step for all KGX files.
  • Modeling Proposals:
    • Add a source_files property to the RetrievalSource object - see propoasl in PR here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions