-
Notifications
You must be signed in to change notification settings - Fork 26
Dipper Concept Maps
The data model for each Dipper source is currently defined and documented using concept map diagrams (cmaps). These cmaps use a graphical notation and syntax to fully specify the RDF to be generated for one or more exemplar records in the source data. This DIPper cmap notation can be considered a diagrammatic serialization format for RDF data, as it allows complete representation of any RDF graph.
We use the cmaptools application to generate concept maps, but other concept mapping or diagramming tools should be able to reproduce the features of the Dipper notation and syntax.
The figure below illustrates the base graphical notation and syntax, and provides an example of how a simple Dipper cmap diagram (two nodes and an edge) compactly represents an RDF graph comprised of seven triples.

-
Allow ontologists/data modelers to rapidly prototype and evaluate data models, and share them with dipper developers
- The cmaps use a compact syntax for specifying RDF that makes models easier to visualize and understand.
-
Provide a target data model specification for a given source
- The cmaps do this using one or more exemplar records, for which they fully specify the RDF output desired from Dipper (as opposed to defining a formal, computable schema).
- If there is irregularity across records from the source (e.g. variability in structure or content), multiple examples are diagrammed to account for the spectrum of variation.
- The cmap notations also indicate what data should be derived from the ingested source, vs data that should come from existing ontologies/data already in SciGraph (e.g. labels of class IRIs used to indicate type of instances in the data are shown in cmaps for human readability, but 'commented out' so coders know not to pull class labels from the source).
-
Optionally provide an informal 'transform specification' for a given source
- In addition to their informal target model specification function, cmaps can also provide informal transform specifications that describe how to translate source data (csv, xml, json, etc) to the desired target RDF model (e.g. here).
- This is currently captured in structured 'annotations' on each node in the cmap, that provide the following types of information:
- Node Type: The 'meta' type of the node - one of [ named individual | anonymous individual | punned class ]
- IRI Source: The location in the source data where the identifier needed to build an IRI is found (e.g. which column, element, attribute, etc).
- IRI Structure: How to construct an IRI for the entity from a namespace and unique identifier pulled from the data.
- IRI Label: How to construct a label for the entity.
- IRI Type: The source and rules for determining the rdf:type of the entity represented by the IRI.
- Notes: Additional issues or considerations.
-
Provide a 'gold standard' for unit tests to validate dipper output for a given source (in progress)
- We are developing a script to convert the cmap to RDF (ttl) for the diagrammed exemplar record, using the text-based export format for cmaps.
- For example, see the SGD cmap and its derived ttl file here.
- This ttl can be passed to a sparql query and used as a unit test to QA generated data (i.e. to check if dipper output contains exact triples specified in the cmap diagram)
-
Provide persistent, granular documentation of the data
- Useful for developers and users to understand what we pull from each data source, and how it is structured.
- e.g. monarch developers writing SPARQL to index data
- e.g. external developers users querying via cypher or SPARQL endpoints, etc
- e.g. ...
- Useful for developers and users to understand what we pull from each data source, and how it is structured.
Note that what cmaps do not provide is a computable schema that can be used to automatically and comprehensively validate Dipper output. But as noted above, they can provide gold standard ttl RDF for one or more records to support validation using sparql-based unit tests. If the exemplar record(s) for a source cover the full spectrum of variability in the source data, this should in theory provide a pretty good set of unit tests to validate dipper output.