Skip to content

02. Adapting the data model

Joe Raad edited this page Aug 27, 2021 · 11 revisions

This wiki guides you through the steps to set up your data to link person observations. After these steps you should be able to use burgerLinker. The required output is an RDF file, describing your data according to the Civil Registries Schema (CIV).

1. Data model description

In this section, we use the following prefixes:

For using burgerLinker, it is necessary to represent your data in RDF, according to the CIV model (in Figure below).

This data model is composed of three parts:

  • Person (blue): This part is only composed of the class schema:Person, representing the individuals described in the civil registries. An instance of this class must have a unique identifier (civ:personID), a first name (schema:givenName), and a last name (schema:familyName). All these properties are required for linking persons. In addition, for improving the accuracy and the speed of linking, it is recommended to add the gender (schema:gender) of every individual.

  • Events (green): We make a distinction between three different types of events: civ:Birth, civ:Marriage, and civ:Death. These three types of events are all sub-types of the general class civ:Event. Being sub-type of civ:Event means that these three classes inherit the properties of their general class, i.e. each instance of the class civ:Birth, civ:Marriage, and civ:Death can have the five relations that are associated with civ:Event. Out of these five relations, only two are required for linking: a unique event/registration identifier (civ:registrationID) and the date of an event (civ:eventDate). The remaining three optional relations are used for indicating the date of registration (civ:registrationDate), its location (civ:registrationLocation) and the event location (civ:eventLocation). In this model, a distinction is made between the the date/location of an event and the date/location of its registration in the civil registries, as certain civil registrations can be produced in different dates and locations from where the life event actually happened.

    In addition to these five general relations, each of these three types of events have different relations associated to it:

    • civ:Birth: An instance of this class can have the three properties: civ:newborn, civ:mother, and civ:father. For linking, all information regarding the newborn must be present in a birth event, in addition to at least one of their parents.

    • civ:Marriage: An instance of this class can also have the six properties: civ:bride, civ:motherBride, civ:fatherBride, civ:groom, civ:motherGroom, civ:fatherGroom. For linking, all information regarding the bride and groom must be present in a marriage event, in addition to at least one parent for each of the bride and groom.

    • civ:Death: An instance of this class can also have the four properties: civ:deceased, civ:partner, civ:mother civ:father. For linking, all information regarding the deceased must be present in a death event, in addition to at least one of their parents.

  • Location (yellow): The final part describes the location where each life event has happened and the location where it was registered. In this part, information regarding the municipality, the province, the region, and the country can be available. This part is completely optional, as none of the information regarding the locations of the events and their registrations are used for linking.

Civil Registries Schema (CIV)

2. Data model example

In this section, we rely on the following example dataset for showing how such a tabular dataset can be represented as an RDF graph according to the CIV data model. This tabular dataset is composed of five rows, describing two different events:

  • Event-1: The first three rows represent the birth event of the newborn "Paulina Boven", born on October 28, 1835. Both her mother "Johanna Pieternella Vermeulen" and her father "Petrus Boven" are present in this birth registration.

  • Event-2: The bottom two rows represent the death event of the deceased "Paulina Maria Bovem", deceased on February 15, 1900. Only her mother "Joanna Vermeulen" is present in this death registration.

row registrationID registrationType eventDate personID role givenName lastName gender
1 1 Birth 1835-10-28 1 newborn paulina boven f
2 1 Birth 1835-10-28 2 mother johanna pieternella vermeulen f
3 1 Birth 1835-10-28 3 father petrus boven m
4 2 Death 1900-02-15 4 deceased paulina maria bovem f
5 2 Death 1900-02-15 5 mother joanna vermeulen f

In the following, we show how such dataset can be represented as an RDF graph according to the CIV model, described above.

General guidelines:

  • The mydata prefix refers to an example namespace. When creating your RDF data, it is recommended to change it for a namespace from your institution (e.g. @prefix mydata: <https://myinstitution.com/dataset/> .)
  • Every event, independently from its type, must have a unique identifier (i.e. unique IRI and unique value for the property civ:registrationID)
  • Every person must have a unique identifier (i.e. unique IRI and unique value for the property civ:personID)
  • Dates in RDF are expressed using the following format: YYYY-MM-DD
@prefix civ: <https://iisg.amsterdam/id/civ/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .
@prefix mydata: <https://mydata.com/example/dataset/> .

mydata:event-1 rdf:type civ:Birth ;
               civ:registrationID "1"^^xsd:integer ;
               civ:eventDate "1835-10-28"^^xsd:date ;
               civ:newborn mydata:person-1 ;
               civ:mother mydata:person-2 ;
               civ:father mydata:person-3 .            

mydata:event-2 rdf:type civ:Death ;
               civ:registrationID "2"^^xsd:integer ;
               civ:eventDate "1900-02-15"^^xsd:date ;
               civ:deceased mydata:person-4 ;
               civ:mother mydata:person-5 .

mydata:person-1 rdf:type schema:Person ;
               civ:personID "1"^^xsd:integer ;
               schema:givenName "paulina"^^xsd:string ;
               schema:familyName "boven"^^xsd:string ;
               schema:gender "f"^^xsd:string . 

mydata:person-2 rdf:type schema:Person ;
               civ:personID "2"^^xsd:integer ;
               schema:givenName "johanna pieternella"^^xsd:string ;
               schema:familyName "vermeulen"^^xsd:string ;
               schema:gender "f"^^xsd:string .

mydata:person-3 rdf:type schema:Person ;
               civ:personID "3"^^xsd:integer ;
               schema:givenName "petrus"^^xsd:string ;
               schema:familyName "boven"^^xsd:string ;
               schema:gender "m"^^xsd:string .

mydata:person-4 rdf:type schema:Person ;
               civ:personID "4"^^xsd:integer ;
               schema:givenName "paulina maria"^^xsd:string ;
               schema:familyName "bovem"^^xsd:string ;
               schema:gender "f"^^xsd:string .

mydata:person-5 rdf:type schema:Person ;
               civ:personID "5"^^xsd:integer ;
               schema:givenName "joanna"^^xsd:string ;
               schema:familyName "vermeulen"^^xsd:string ;
               schema:gender "f"^^xsd:string .

We used the Turtle syntax (.ttl file) to describe these example civil registries, but other RDF syntaxes can also be used.

For example, here's how event-1 can be described using the N-Triples syntax, which is probably more convenient when manually writing CSV-to-RDF conversion scripts.

This article provides a nice introduction to the different available RDF syntaxes.

<https://mydata.com/example/dataset/event-1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://iisg.amsterdam/id/civ/Birth> .
<https://mydata.com/example/dataset/event-1> <https://iisg.amsterdam/id/civ/registrationID> "1"^^<http://www.w3.org/2001/XMLSchema#integer> .
<https://mydata.com/example/dataset/event-1> <https://iisg.amsterdam/id/civ/eventDate> "1835-10-28"^^<http://www.w3.org/2001/XMLSchema#date> .
<https://mydata.com/example/dataset/event-1> <https://iisg.amsterdam/id/civ/newborn> <https://mydata.com/example/dataset/person-1> .
<https://mydata.com/example/dataset/event-1> <https://iisg.amsterdam/id/civ/mother> <https://mydata.com/example/dataset/person-2> .
<https://mydata.com/example/dataset/event-1> <https://iisg.amsterdam/id/civ/father> <https://mydata.com/example/dataset/person-3> .

3. Conversion to RDF

Given the format, structure, and size of your dataset, a different range of tools and scripts can be deployed for converting your data to RDF.

In the following, we describe two ways for converting a CSV dataset to RDF according to the CIV model described above. The CSV dataset of Dutch civil registries that we use in CLARIAH has a similar structure to the dataset described in the example above.

A working version of Python is needed to convert the data to RDF using our script or using our RDF converter tool COW, preferably somewhat close to version 3.7 for your platform of choice. We provide a simple tutorial for installing Python for the three major operating systems: Linux, OS X, and Windows.

- Option 1: Python script (recommended for large datasets, e.g. > 1 million civil registries)

Our script for converting a large part of the Dutch civil registries dataset from CSV to RDF is publicly available. The complete dataset is not publicly available for privacy reasons (more than 16 million civil registries), but can be made available upon request. Instead, we provide in the same directory an example dataset containing three civil registries: one birth, one marriage, and one death.

This provided Python script takes as input one CSV file describing the registrations (registrations.csv) and one CSV file describing all individuals that are involved in the registrations (persons.csv). It returns as output two RDF files in N-Quads syntax (.nq files).

Given that burgerLinker requires a single input RDF file, these two returned RDF files can be directly merged into one larger file using the following command using the command line:

cat registrations.nq persons.nq > merged-dataset.nq

- Option 2: COW (recommended for relatively small datasets)

If you have a standardized dataset - saved as csv - that you want to link to other person observations, such as the Civil Registry/LINKS, the first step is to generate a corresponding metadata file in COW. Users not familiar with COW should read the documentation.

Using build in COW, should have created a file ending with the same name as the input csv, but ending with .metadata.json. This file needs to be adapted in order for burgerLinker to work, since burgerLinker uses a predefined schema to connect person and event observations (e.g. a newborn in a birth event). This Civil Registries Schema can be found here. Basically the data model is <Registration> <Event> <Person>, and each of these need to follow the schema and have the correct URI's.

Follow these guidelines next to adapt your metadata.json:

  • Change the @base URL in the metadata file to https://iisg.amsterdam/id/civ/
  • Registration, Event and Person URI's need to have an ID as their last fragment.These ID's can be defined by the user.
    E.g. https://iisg.amsterdam/id/civ/event/b-1000
  • ID's should be unique within Registration, Event, and Person.
  • For newborns, gender needs to be defined with schema.org/gender as propertyUrl. The subject can follow schema.org/gender or can be a lowercase string ('m' or 'f'). Note that in the latter case you need to use csvw:value and not valueUrl for the object. For other person types (mother, father, bride, groom, etc.) gender is derived from the role (https://iisg.amsterdam/id/civ/vocab/father).
  • The Registration step is not mandatory for burgerLinker to work, but can be used to separate a registration from the event.
  • Attributes can still be anything you like. For instance, you can add variables of persons (occupations, address, etc.) under https://iisg.amsterdam/id/civ/person/b-356. The same goes for Registration and Event. Note that some of these are defined in the Civil Registries Schema as well, such as age or occupation.

An example of a working code block in a metadata file:

"virtual": true,
"aboutUrl":"event/b-{id}",
"propertyUrl": "https://iisg.amsterdam/id/civ/vocab/newborn",
"valueUrl": "person/b-{{(id|int * 3) + 1}}"

Say the 'id' variable from the csv is 5, this block creates the triple <https://iisg.amsterdam/id/civ/event/b-5> <https://iisg.amsterdam/id/civ/vocab/newborn> <https://iisg.amsterdam/id/civ/person/b-16>. In other words, event b-5 has a newborn with id b-16.

In this case persons had no unique id number in the csv, so these are created using a simple function using the 'id' variable (referring to the event) as input. Fathers and mother of course need a different function to make sure id's don't overlap (id * 3 + 2 and id * 3 + 3 respectively).

Examples of a correctly modeled dataset can be found at the examples page.

Use COW to convert the csv using the adapted metadata.json file. To make sure the output follows the Civil Registries Schema it is advised to upload the triples (.nq file) to druid.datalegend.net and compare it to the datamodel of the LINKS Zeeland dataset. If you don't have access to this dataset send us a message.

If you want two link entities across two or more of your own datasets, repeat the first three steps for the other dataset(s). See the next paragraph if you want to link your dataset to the Civil Registry.