lobid-organisations

lobid-organisations is a web app implemented with Play to serve the JSON-LD context as application/ld+json with CORS support. This is required to use the JSON-LD from third party clients, e.g. the JSON-LD Playground. It also provides proxy routes for Elasticsearch queries via HTTP (see index page of the web app for details).

About

Data transformation workflows, web API and UI for the lobid-organisations data set based on Metafacture and Play Framework.

Transforms two data sets from Pica-XML and CSV to JSON-LD for Elasticsearch indexing. Data of two or more entries are merged if the DBS IDs (INR) of the entries are identical. The resulting data is used to build an index in Elasticsearch.

The resulting JSON is JSON-LD and therefore provides machine-readable Linked Data. The context file lists all used RDF properties and classes: http://lobid.org/organisations/context.jsonld

This repo replaces the lobid-organisations part of https://github.com/lobid/lodmill.

For information about the Lobid architecture and development process, see http://hbz.github.io/#lobid.

Setup

This section contains information about building and deploying the repo, running tests, and setting up Eclipse.

Test

Prerequisites: Java 11, Maven 3 (verify with mvn -version); sbt

Create and change into a folder where you want to store the projects:

mkdir ~/git ; cd ~/git

Get lobid-organisations, set up the Play application, and run the tests:

git clone https://github.com/hbz/lobid-organisations.git
sbt clean
sbt test

See the .github/workflows/build.yml file for details on the CI config used by Github Actions.

The Elasticsearch tests are defined in in test/controllers.

The Metafacture tests are defined in test/transformation/TestTransformAll.java and are based on transforming sample test file from the ISIL dump and the DBS export.

Deployment

Short instructions for clean deployment, includes hbz-internal instructions that won't work outside the hbz network. Find detailed developer documentation further below. After the build steps above, edit conf/application.conf as required (e.g. ports to be used by the embedded Elasticsearch).

On start up, the web app will looks in /tmp/lobid-organisations. If there is already data of minimum size the data is used to build an index. The "minimum size" is checked to prevent building up an index that only contains part of the available data or no data at all (e.g. if something goes wrong during the transformation the result may be an empty file). This minimum size threshold is specified in `conf/application.conf

If there is no data or not of minimum size the source data is transformed and the result is saved to /tmp/lobid-organisations. This data is used to build an Elasticsearch index.

These steps can be triggered separately using HTTP POST when the application is up and running. See below under Data-Trigger... how to do this.

Get the source data dumps

To get the wikidate lookup table conf/wikidataLookup.tsv:

bash getWikidataLookupTableViaSparql.sh

The dbs.csv is not open data. It's provided by "bibliotheksstatistik.de" via scp to /home/dbs/, once a week, and copied from there to:

app/transformation/input/

The sigel.dat is not open data. It's provided by "Zeitschriftendatenbank.de". To get it internally, see checkIfBaseDataShouldBeUpdated.sh .

Set "minimum size"

The minimum size threshold is specified in conf/application.conf.

Set date from where updates of sigil will be fetched

Updates of the Sigel data can be fetched.The date of the base dump is set, e.g. 2013-06-01, in conf/application.conf. Updates will be downloaded from this date on until today.

Start the application

Check if $JAVA_HOME variable is set

echo $JAVA_HOME

Set the variable to the home folder, not the path of your JAVA installation: e.g.: export JAVA_HOME="/usr"

sbt clean
sbt --java-home $JAVA_HOME stage
JAVA_OPTS="$JAVA_OPTS -XX:+ExitOnOutOfMemoryError" ./target/universal/stage/bin/lobid-organisations -Dhttp.port=7201 -no-version-check

When startup is complete (Listening for HTTP on /0.0.0.0:7201), exit with Ctrl+D, output will be logged to target/universal/stage/logs/application.log.

The web application can also be accessed via http://lobid.org/organisations.

For monitoring config on quaoar1, see /etc/monit/conf.d/play-instances.rc. Monit logs to /var/log/monit.log. Check status with sudo monit status, reload config changes with sudo monit reload, for more see man monit.

Data

This section contains additional information about the data workflows, indexing, and querying.

Workflow

The source data sets are the Sigelverzeichnis ('Sigel', format: PicaPlus-XML) and the Deutsche Bibliotheksstatistik ('DBS', format: CSV). The transformation is implemented by a pipeline with 3 logical steps:

Preprocess Sigel data, use DBS ID as record ID; if no DBS ID is available, use ISIL; in this step, updates are downloaded for the time period from base dump creation until today
Preprocess DBS data, use DBS ID as record ID
Combine all data from Sigel and DBS:
- Merge Sigel and DBS entries that have identical DBS IDs
- Entries with a unique DBS ID or without DBS ID are integrated as well --- they are not merged with any other entry
- The entries in the resulting data set have a URI with their ISIL as ID (e.g., http://lobid.org/organisations/DE-9). If no ISIL is available, a Pseudo-ISIL is generated consisting of the string 'DBS-' and the DBS ID (e.g., http://lobid.org/organisations/DBS-GX848).

Each of these steps has a corresponding Java class, Fix scripts, and output file.

Finally, the data is indexed in Elasticsearch. The ID of an organisation is represented as a URI (e.g., http://lobid.org/organisations/DE-9). However, when building up the index, the organisations are given the last bit of this URI only as Elasticsearch IDs (e.g., DE-9). Thus, Elasticsearch-internally, the organisations can be accessed via their ISIL or Pseudo-ISIL.

Transformation and indexing are done automatically when starting the application. However, both processes can be triggered separately, see below.

Trigger transformation

The transformation is triggered when the application starts but it can also be started separately when the application is running (only works hbz internally).

If you run the transformation with the full data (see above), the application will download additional updates for the Sigel data.

Thus, you will have to specify one parameter in @conf/application.conf@ : the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01).

You can run the transformation of the full data using the following command:

curl -X POST "http://localhost:9000/organisations/transform"

Trigger indexing

Indexing is triggered when the application starts but it can also be started separately when the application is running. You can use the following command to do so:

curl -X POST "http://localhost:9000/organisations/index"

⚠️ Because of bug (see: #435) a restart afterwards is mandatory.

Query

Query the resulting index:

curl -XGET 'http://localhost:7211/organisations/_search?q=*'; echo

For details on the various options see the query string syntax documentation.

Get a specific record by DE-38:

curl -XGET 'http://localhost:7211/organisations/organisation/DE-38'; echo

Exclude the metadata (you can paste the resulting document into the JSON-LD Playground for conversion tests etc.):

curl -XGET 'http://localhost:7211/organisations/organisation/DE-38/_source'; echo

For details on the various options see the GET API documentation.

License

Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html

Name		Name	Last commit message	Last commit date
Latest commit History 1,020 Commits
.github/workflows		.github/workflows
.settings		.settings
app		app
conf		conf
project		project
public		public
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
checkIfBaseDataShouldBeUpdated.sh		checkIfBaseDataShouldBeUpdated.sh
cron.sh		cron.sh
getWikidataLookupTableViaSparql.sh		getWikidataLookupTableViaSparql.sh
monit_restart.sh		monit_restart.sh
restart.sh		restart.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lobid-organisations

About

Setup

Test

Deployment

Get the source data dumps

Set "minimum size"

Set date from where updates of sigil will be fetched

Start the application

Data

Workflow

Trigger transformation

Trigger indexing

Query

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lobid-organisations

About

Setup

Test

Deployment

Get the source data dumps

Set "minimum size"

Set date from where updates of sigil will be fetched

Start the application

Data

Workflow

Trigger transformation

Trigger indexing

Query

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages