-
Notifications
You must be signed in to change notification settings - Fork 1
Meta data
The meta data is stored in a MongoDB database. It is divided into two parts:
- a specification of our internal schema
- the set of used sources
Our internal schema is based on the multi graph model: We have a set of entities (books, authors, publishers, ...). Every object is an instance of a specific type (book, author,...) and can have a set of properties (title, name, numbers). These entities can be linked by relations (writtenBy, publishedBy, ...) which are directed edges. Properties on relations are not supported. Inheritance is not supported neither.
The specification of our schema is saved as a set of object type definitions. The specification is pretty minimalistic: F.e. a specific definition could have the following form:
[
{
"_id": <MongoDB id>,
"name": "Book",
"properties": ["title", "year", "genre", "isbn"],
"links": ["writtenBy", "publishedBy"],
"displayAs": ["$title"],
"equality": [
"or", [
["=", "title"],
["and", [["=", "title"], ["=", "year"]]]
]
]
},
{
"_id": <MongoDB id>,
"name": "Person",
"properties": ["birthday", "firstname", "surname"],
"displayAs": ["$surname", ", ", "$firstname"],
"links": {"wrote", "published"}
}
]
Every object type is identified by its MongoDB id _id. We are not using the name for identifying an object type since otherwise changes of the name would have to be propagated to other entries in the meta data repository which are linking to this object type. name specifies a human readable name and is used only for display purposes. properties contains an enumeration of available properties. Links are defined in the same way. In addition, it is specified how the title of an object should be formed. This title is displayed for example when other entities are linking to this object as the link text.
No assumptions about the type or cardinality of entities/relations were made so far and we are not planning to add support for this feature since this is not among the objectives of this project. Nevertheless, the example above contains a draft, how this type of meta data could be added.
In addition, a definition of the equality relationship is provided. This definition is used in order to merge duplicates. In the example above, 2 books are defined to be equal if they have the same isbn or the same title and year. So far, we are only supporting exact matches. Most likely we will switch to a fuzzy equality later which will be based on the Levenshtein distance. Our current approach for defining identity has two obvious drawbacks:
- identity based on the relations is not supported. This problem can not be resolved easily since defining equality of an entity based on the equality of related entities can result in infinite recursion easily. So far we do not have a good approach for solving this issue.
- Determining the equality of field values might be non-trivial. F.e the two author names "Rowling, J.K." and "Joanne K. Rowling" are referring to the same persons but a simple check for string equality will not unveil this equality. Even Levenshtein distance does not solve this problem.
For every source an adapter is used in order to access the contained information. Additional configuration can be passed to the used adapter. Hence the same adapter (f.e. a Sparql adapter) can be reused for multiple different sources by adjusting the configuration.
In addition, the mapping from the source's schema to the global schema are saved for each source. By saving this together with the source, it is made sure that this mapping will be deleted as soon as the source is removed.
Example:
[
{
"_id": <MongoDB id>,
"name": "DBpedia",
"adapter": {
"name": "sparql-client",
"config": {
"endpoint": "http://dbpedia.org/sparql"
}
},
"mapping": {
"<MongoDB id of book entity type>": {
"local_type": "http://schema.org/Book",
"fields": {
"title": "http://dbpedia.org/fields/bookTitle",
"isbn": "http://dbpedia.org/fields/isbnNumber"
},
"links": {
"writtenBy": "http://dbpedia.org/relations/isAuthor"
}
}
}
}
]
(the dbpedia URLs are invented and only used for this explanation.)
The meaning of id and name should be obvious. adapter.name specifies the name of the adapter for this source. This name is mapped directly to the file name of the NodeJS module which contains the adapter for this source. adapter.config contains additional configuration which is stored in a adapter specific schema and passed directly to the adapter.
mapping contains informations about the mapping from the local schema to the global schema. So far only renaming of fields is solved. The following problems still need to be solved:
- Fields with different granularities. F.e.: some sources split the authors name in first name and surname, others only provide a field which contains both informations concatenated. We still need to find a possibility to increase/decrease the granularity of entries by joining multiple source fields to one field of the global scheme respectively splitting one source field into multiple fields for the global scheme.
- Differences in the graph structure: Some sources store the author's name as a property of a book while others are adding a link to an author and store his name as a property of this linked node.
- In addition, the inverse case must be taken into account as well. F.e. we want to store keywords for every book. While our XML source provides these keywords as a property of a book (as intended), DBpedia links to an object representing this topic. We have to extract the title of this linked topic and add it as a property value to our book.
We will most likely ignore the first issue. A draft of a solution for the second and the third issue can be found below although we do not believe that this approach is already mature.
The following snippet is just a draft...
"mapping": {
"<MongoDB id of book object type>": {
"local_type": "http://dbpedia.org/ontology/book",
"fields": {
"p1": "http://dbpedia.org/fields/bookTitle",
"p4": "http://dbpedia.org/fields/isbnNumber"
},
"links": {
"1": "http://dbpedia.org/relations/isAuthor"
},
"artificial_links": [
{
"link": "<id of a link of the book type>",
"linked_type": "<id of linked object type>",
"fields": [
"<id of a property of the linked object>": some expression... no idea...,
<list of fields of properties>
]
}
],
"artificial_fields": [
"<id of field 'genre'>": ["follow:http://dbpedia.org/relations/topic", "property:http://dbpedia/fields/name"],
<additional artificial fields>
]
}