Import RDF data into a Knowledge Graph
This web service implements a controlled workflow to import RDF data into the triple store of a knowledge graph. The service is provided as Docker image but it can also be run from sources for development and testing.
Development is being funded as part of NFDI4Objects to build the NFDI4Objects Knowledge Graph.
- Usage
- Configuration
- API
- General endpoints
- Terminologies
- GET /terminology
- GET /terminology/{id}
- PUT /terminology/{id}
- DELETE /terminology/{id}
- PUT /terminology/
- GET /terminology/{id}/stage/
- POST /terminology/{id}/receive
- GET /terminology/{id}/receive
- POST /terminology/{id}/load
- GET /terminology/{id}/load
- POST /terminology/{id}/remove
- GET /terminology/namespaces.json
- GET /terminology/skosmos.ttl
- Collections
- GET /collection/
- GET /collection/schema.json
- PUT /collection/
- POST /collection/
- GET /collection/{id}
- PUT /collection/{id}
- DELETE /collection/{id}
- GET /collection/{id}/stage/
- POST /collection/{id}/receive
- GET /collection/{id}/receive
- POST /collection/{id}/load
- GET /collection/{id}/load
- POST /collection/{id}/remove
- Mappings
- GET /mappings/
- GET /mappings/schema.json
- GET /mappings/properties.json
- PUT /mappings/
- POST /mappings/
- GET /mappings/{id}
- PUT /mappings/{id}
- DELETE /mappings/{id}
- POST /mappings/{id}/append
- POST /mappings/{id}/detach
- GET /mappings/{id}/stage/
- POST /mappings/{id}/receive
- GET /mappings/{id}/receive
- POST /mappings/{id}/load
- GET /mappings/{id}/load
- POST /mappings/{id}/remove
- Development
- License
Three kinds of data can be imported seperately:
- terminologies such as ontologies and controlled vocabularies (as listed in BARTOC)
- collections of arbitrary RDF data from open research data repositories
- mappings between resources from terminologies
Each terminology, each collection, and each mapping source is imported into an individual named graph.
Importing is controlled via an HTTP API in three steps:
- register: metadata is retrieved, collected in a registry and written to the triple store
- receive: data is retrieved into a stage directory where it is validated, filtered, and a report is generated
- load: processed data is loaded into the triple store
Register can be undone by additional step delete. Load and receive can be undone by step remove. Mappings can also be injested and withdraw directly into/from the triple store via append/detach to support non-durable live-updates.
flowchart LR
END[ ]:::hidden
START[ ]:::hidden
C[ ]:::hidden
R["**register**"]
S("**stage**"):::data
T("**triple store**"):::data
START -- register --> R
R -- receive --> S
S -- load --> T
T -- remove --> R
R -- delete --> END
END ~~~ R
C ~~~ R
C <-. append/detach .-> T
classDef hidden display: none;
classDef data stroke:#4d8dd1, fill:#D4E6F9;
The application does not include any methods of authentification. It is meant to be deployed together with components described in n4o-graph repository. In particular:
- n4o-fuseki: RDF triple store
- n4o-graph-apis: web interface and public SPARQL endpoint
The knowledge graph is organized in individual named graphs. URIs of most of these graphs are based on a namespace prefix, like http://example.org/. The prefix base can be changed by configuration and it is currently set to https://graph.nfdi4objects.net/ by default.
-
metadata about all terminologies is in graph of URI
http://example.org/terminology/ -
metadata about all collections is in graph of URI
http://example.org/collection/ -
metadata about all mapping sources is in graph of URI
http://example.org/mappings/ -
each terminology is imported into a graph by its BARTOC URI
-
each collection is imported into a graph of URI namespace
http://example.org/collection/, followed by a numeric identifier -
mappings are grouped into mapping sources, each imported into a graph of URI namespace
http://example.org/mappings/, followed by a numeric identifier
flowchart TD
collectionMeta --describe--> collection
terminologyMeta -- describe --> terminology
mappingsMeta --describe--> mappings
collectionMeta["collection/"]
collection["collection/{id}"]
terminology["terminology/{id}"]
terminologyMeta["terminology/"]
mappings["mappings/{id}"]
mappingsMeta["mappings/"]
Received data in RDF or JSKOS format must be syntactically valid. Additional constraints are being implemented. Validation errors are returned in Data Validation Error Format. For instance an error like this is emitted when JSON field url is no valid URL:
{
"message": "'http:/example.org/' does not match '^https?://'",
"position" {
"jsonpointer": "/url"
}
}JSON metadata to describe collections and mapping sources is validated with JSON Schemas collection-schema.json and mappings-schema.json, respectively.
JSKOS data (for terminologies and mappings) is not validated yet (see open issue).
RDF data is allowed to contain any absolute IRI references matching the regular expression ^[a-z][a-z0-9+.-]*:[^<>"{}|^`\\\x00-\x20]*$. This includes some IRI references invalid in theory but supported by most RDF software in practice. Additional constraints on RDF data do not result in validation errors but malformed triples are filtered out (as described in the following) and collected as part of report files.
RDF data is filtered depending on kind of data and configuration. Triples (aka RDF statements) matching the following criteria are filtered out:
-
Triples with relative IRI references and local
file:URIs -
Statements about registered terminologies in collection data
-
... (RDF filtering is still being worked on and not fully documented yet)
Filtered RDF triples are collected as part of reports.
Reports are generated on receiving and loading data. The final format of reports has not been specified yet.
Reports can be accessed with the following API methods:
- GET /terminology/{id}/receive
- GET /terminology/{id}/load
- GET /collection/{id}/receive
- GET /collection/{id}/load
- GET /mappings/{id}/receive
- GET /mappings/{id}/load
Receiving data generates two additional files in the stage directory (replace {id} with a selected numeric identifier):
terminology-{id}.nt/collection-{id}.nt/mappings-{id}.nt: validated and filtered RDF triples to be importedterminology-{id}-removed.nt/collection-{id}-removed.nt/mappings-{id}-removed.nt: triples removed on filtering
The web service and its Docker image can be configured via environment variables:
TITLE: title of the application. Default:N4O Graph ImporterBASE: base URI of named graphs. Default:https://graph.nfdi4objects.net/(this will be changed tohttp://example.org/)SPARQL: API endpoint of SPARQL Query protocol, SPARQL Update protocol and SPARQL Graph store protocol. An in-memory Triple store is used as fallback ifSPARQLis not set.STAGE: writeable stage directory. Default:stageDATA: local data directory for file importFRONTEND: URL of n4o-graph-apis instance. This is included as fieldfrontendin /status.json and shown in the HTML interface for convenience. Default is the value ofBASE
If the data directory contains a file bartoc.json with an array of JSKOS records from BARTOC, this file is used as source of terminology metadata instead of BARTOC API. Script update-terminologies in this repository can be used to get a subset from BARTOC, including all terminologies listed in NFDI4Objects.
There is a minimal HTML interface at root path (GET /) to try out the API. This is more useful than an interface generated automatically, for instance with Swagger. The API is not meant to be publically available (there is no authentification), so there is no need for an OpenAPI document anyway.
A HTTP 400 error with response body in Data Validation Error Format is returned if collection metadata or mapping source metadata does not conform to its corresponding JSON Schema.
Get curent information about the application as JSON object. This includes the configuration with lowercase field names and field connected whether the SPARQL API endpoint can be accessed.
List and get files from local data directory.
Terminologies are identified by their BARTOC identifier. Terminology data should be registered before receiving collection data to detect use of terminologies in collections.
Return the list of registered terminologies.
Return metadata of a registered terminology.
Register a terminology or update its metadata from BARTOC. The metadata is directly added to the triple store. Updates may lead to errors in description of terminologies because removal of statements is limited to simple triples with terminology URI as subject!
Unregister a terminology and remove it from stage directory and triple store. This implies DELETE /terminology/{id}/remove.
Replace the list of terminologies. By unregistering all and registering a new list. The response body is expected to be a JSON array with objects having key uri with the BARTOC URI like this:
[
{ "uri": "http://bartoc.org/en/node/18274" },
{ "uri": "http://bartoc.org/en/node/20533" }
]Other fields are ignored so the return value of GET /terminology/ can be used as payload.
List and get files of the stage directory of a terminology.
Receive terminology data. The location of the data is going to be extracted from terminology metadata from BARTOC but this has not been implemented yet. For now pass query parameter from instead to locate an URL or the name of a file in the data directory. File format can be:
- RDF/Turtle for file extension
.ttlor.nt - RDF/XML for file extension
.rdfor.xml - JSKOS as newline delimited JSON for file extension
.ndjson - A ZIP archive containing RDF files for file extension
.zip
Get latest receive report of a terminology.
Load received terminology data into the triple store. If the terminology is SKOS format (possibly converted from JSKOS) and the terminology has title and language information, a configuration file skosmos.ttl to be used for Skosmos is also generated in the stage are of the terminology.
Get latest load report of a terminology.
Remove terminology data from the triple store and from staging area. The terminology will still be registered and its metadata is not removed from the triple store.
Return registered URI namespaces forbidden to be used in RDF subjects. The result is a JSON object with terminology URIs as keys and namespaces as values. For instance the SKOS (http://bartoc.org/en/node/18274) namespace is http://www.w3.org/2004/02/skos/core# so RDF triples with subjects in this namespace can only be added to the knowledge graph via /terminology/18274.
Return Skosmos vocabulary configuration for imported (J)SKOS vocabularies.
Collections are described in a custom JSON format described by JSON Schema collection-schema.json. This JSON data is enriched with field id and internally converted to RDF for import into the knowledge graph. In its simplest form, a collection should contain a name, an URL, and a license:
{
"name": "test collection",
"url": "https://example.org/",
"license": "https://creativecommons.org/publicdomain/zero/1.0/"
}When registered, the collection is assigned an id, and a corresponding URI.
Return the list of registered collections (metadata only).
Return the JSON Schema used to validation collection metadata. See file collection-schema.json. Collection field id is required by the schema but it gets assigned automatically in most cases.
Replace the list of collections by unregistering all and registering a new list of collections.
Register a new collection or update metadata of a registered collection.
Return metadata of a specific registered collection.
Update metadata of a specific registered collection or register a new collection.
Unregister a collection and remove it from the triple store and staging area. This implies DELETE /collection/{id}/remove.
List and get files of the stage directory of a collection.
Receive and process collection data. The location of the data is taken from collection metadata field access if existing. The location can be overridden with optional query parameter from with an URL or a file name from local data directory. File format can be:
- RDF/Turtle for file extension
.ttlor.nt - RDF/XML for file extension
.rdfor.xml - A ZIP archive containing RDF files for file extension
.zip
Get latest receive report of a collection.
Load received and processed collection data into the triple store.
Get latest load report of a collection.
Remove collection data from the triple store and from staging area. The collection will still be registered and its metadata is not removed from the triple store.
Mappings are grouped in mapping sources, which correspond to concordances or lists of mappings.
Return the list of registered mapping sources.
Return the mapping sources schema mappings-schema.json used to validate mapping sources.
Get a list of supported mapping properties. By default this is the list of SKOS Mapping properties plus owl:sameAs, owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, and rdfs:subPropertyOf.
Register a list of mapping sources. All existing mapping sources and mappings will be deleted.
Register a new mapping source or update metadata of a mapping source.
Return a specific mapping source.
Update metadata of a specific mapping source.
Unregister a mapping source and remove it from the triple store and staging area. This implies DELETE /mappings/{id}/remove.
Directly add mappings to the triple store, bypassing the receive/load workflow. Directly added triples are not stored in the staging area so they will not persist a load operation of the selected mapping source. Expects a JSON array with mappings in JSKOS format.
Directly remove mappings from the triple store. This operation is not reflected in the staging area so it will not persist a load operation of the seleceted mapping source. Expects a JSON array with mappings in JSKOS format. Non-existing mappings are ignored.
List and get files of the stage directory of a mapping source.
Receive and process mappings from a mapping source. The location of the data is taken from mapping source field access if existing. The location can be overridden with optional query parameter from with an URL or a file name from local data directory. The file format is derived from file name extension, unless explicitly specified in metadata field access.format. Mappings can be given as:
- plain RDF triples in Turtle syntax (extension
.ntor.ttl) - plain RDF triples in RDF/XML syntax (extension
.rdfor.xml) - newline delimited JSON with JSKOS Concept Mappings (extension
.ndjson). Only 1-to-1 mappings are included
Mapping metadata such as date of creation and annotations are ignored.
Get latest receive report of a mapping source.
Load received and processed mapping into the triple store.
Get latest load report of a mapping source.
Remove mappings of a specific mapping source from the triple store and from staging area. The mapping source will still be registered and its metadata is not removed from the triple store.
Requires basic development toolchain (sudo apt install build-essential) and Python 3 with module venv to be installed.
make depsinstalls Python dependencies in a virtual environment in directory.venvmake testruns a test instance of the service with a temporary triple storemake startruns the service without restartingmake apiruns the service with automatic restarting (requires install Node modulenodemonwithnpm install)make lintchecks coding stylemake fixcleans up some coding style violations
Best use the Docker image n4o-fuseki to start a triple store configured to be used with the importer:
docker run --rm -p 3030:3030 ghcr.io/nfdi4objects/n4o-fuseki:mainTo also inspect the content of the triple store, use n4o-graph-apis.
TODO: add description how to run this both
The Docker image of n4o-graph-importer is automatically build on GitHub. To locally build and run the image for testing:
docker image build -t grimpo .
IMPORTER_IMAGE=grimpo docker compose run --rm -p 5020:5020 importerTODO: add description how to also run triple store and apis
Licensed under Apache License 2.0.