R2RLint is a quality assessment tool to evaluate the quality of RDB2RDF mappings and the resulting data. R2RLint currently comes with 43 implemented metrics that can be switched on and configured with an individual threshold to customize your RDB2RDF quality assessment.
To install R2RLint run the following commands:
- Get the source code from GitHub:
git clone https://github.com/AKSW/R2RLint.git
- Go to the Git repo directory and run
install.sh
:
cd R2RLint
./install.sh
Before running the assessment two configuration steps need to be taken: First the assessment environment needs to be set up and afterwards the metrics to run are configured
To configure the assessment environment the file etc/environment.properties
needs to be edited. The main configuration options are the following
These options contain settings for the relational database which is mapped to RDF. The options are:
rdb.host
: the hostname or IP address of the host the database management system is running onrdb.port
: the TCP port the database management system is listening onrdb.dbName
: the name of the databaserdb.user
: a user to access the databaserdb.password
: the password of the database user
To run the assessment, an N-triples dump of the generated RDF has to be provided. Additionally the URL to a SPARQL endpoint can be provided, which allows to run SPARQL queries on the dataset without the need that it fits into your main memory. The configuration options are:
dataset.dumpFilePath
: the file path to a file containing N-triples of the generated RDFdataset.serviceUri
(optional): the URI pointing to the SPARQL endpoint to use, e.g.http://dbpedia.org/sparql
dataset.graphIri
(optional): the graph to use for the assessment (note: the consideration of single graphs is not implemented, yet)dataset.usedPrefixes
: since some metrics have to refer to the vocabularies used in the dataset, their prefixes have to be configured explicitly; the prefixes have to be given as comma separated valuesdataset.prefixes
: since some metrics need to know, whether a certain resource is local or external; thus the set of local prefixes is required, given as comma separated values
The following options provide access to some files needed to run an assessment
views.viewDefsFilePath
: the file containing the SML mapping definitions to assessmetrics.settingsFilePath
: the path the properties file containing the metrics settings introduced belowviews.typeAliasFilePath
: a file containing type mappings (usually no changes of the providedtype-map.h2.tsv
file are needed)
An assessment sink is the actual target where the evaluated quality scores with respect to a given metric is written to. Due to the modularity of the R2RLint framework, sinks can be added without bigger wiring efforts and without the need to know the framework internals. The sinks currently implemented are introduced in dedicated sections below. The configuration options are sink specific and thus discussed in the corresponding sink section.
The settings with regards to the actual metrics to run can be found in the file etc/metrics.properties
. R2RLint provides metrics for the following quality dimensions:
- availability
- completeness
- conciseness
- consistency
- interlinking
- interoperability
- interpretability
- performance
- relevancy
- representational conciseness
- semantic accuracy
- syntactic validity
- understandability
To enable a certain dimension for the assessment its value in the etc/metrics.properties
file has to be switched to yes
, e.g.
semantic_accuracy = yes
Only if a dimension is enabled, its metrics are considered. To also activate certain metrics, their values have to be switched to yes
, too, e.g.
semantic_accuracy.preservedFkeyConstraint = yes
Accordingly, in an assessment run, only those metrics are applied, that
- belong to an activated quality dimension
- are activated as well
Besides the activation, thresholds can be set up per metric. If a threshold is set, only those quality scores (and the corresponding metadata) are written to the sink, that have a score that is lower than the configured threshold. To configure a threshold value, a property named <dimension>.<metric>.threshold
has to be added, e.g.:
consistency.homogeneousDatatypes.threshold = 0.95
In this section the sink implementations provided by R2RLint are introduced.
The RDB sink is a measure data sink that writes the actual quality scores and metadata to a relational database. To set up such a sink, the following configuration options have to be provided in the etc/environment.properties
file:
rdbSink.host
: the hostname or IP address of the host the database management system is running onrdbSink.port
: the TCP port the database management system is listening onrdbSink.dbName
: the name of the databaserdbSink.user
: a user to access the databaserdbSink.password
: the password of the database user
The sink reflects the class structure of the R2RLint framework and creates the following tables:
The measure datum table holds the actual quality scores and corrsponding meta data like the ID of the assessment run, its metric name, timestamp etc.
id [bigint PRIMARY KEY] |
dimension [varchar(400)] |
metric [varchar(400)] |
value [real NOT NULL] |
assessment_id [bigint NOT NULL] |
timestamp [timestamp default current_timestamp] |
---|
The node
table holds nodes (i.e. resources or literals) that were reported in the assessment.
id [bigint PRIMARY KEY] |
name [varchar(300) NOT NULL] |
---|
N.M table linking between measure_datum
and node
.
measure_datum_id [bigint REFERENCES measure_datum(id)] |
node_id [bigint REFERENCES node(id)] |
assessment_id [bigint] |
---|
The triple
table holds triples that were reported in the assessment.
id [bigint PRIMARY KEY] |
subject [varchar(500)] |
predicate [varchar(500)] |
object [varchar(3000)] |
---|
N:M table linking between measure_datum
and triple
.
measure_datum_id bigint REFERENCES measure_datum(id) | triple_id bigint REFERENCES triple(id) | assessment_id bigint |
---|
Sometimes, triples are reported in the assessment because of certain characteristic concerning either the subject, predicate or object. In these cases, the triple can be reported with an additional hint, which triple position caused the quality violation.
id [bigint PRIMARY KEY] |
position [varchar(20)] |
subject [varchar(300)] |
predicate [varchar(300)] |
object [varchar(3000)] |
---|
N:M table linking between measure_datum
and node_triple
.
measure_datum_id [bigint REFERENCES measure_datum(id)] |
node_triple_id [bigint REFERENCES node_triple(id)] |
assessment_id [bigint] |
---|
The view_definition
table holds all view definitions that were reported in an assessment.
id [bigint PRIMARY KEY] |
name [varchar(200)] |
mapping_sql_op [varchar(3000)] |
mapping_definitions [varchar(3000)] |
---|
measure_datum_id [bigint REFERENCES measure_datum(id)] |
view_definition_id [bigint REFERENCES view_definition(id)] |
assessment_id [bigint] |
---|
The quad
table contains view definitions' quads that are reported in an assessment run.
id [bigint PRIMARY KEY] |
graph [varchar(300)] |
subject [varchar(300)] |
predicate [varchar(300)] |
object [varchar(3000)] |
---|
There are also M:N tables linking between measure_datum
and quad
, between triple
and quad
, as well as between view_definition
and quad
. The corresponding M:N tables follow the scheme of the M:N tables introduced so far.