lobid-organisations is a web app implemented with Play to serve the JSON-LD context as application/ld+json with CORS support. This is required to use the JSON-LD from third party clients, e.g. the JSON-LD Playground. It also provides proxy routes for Elasticsearch queries via HTTP (see index page of the web app for details).
Data transformation workflows, web API and UI for the lobid-organisations data set based on Metafacture and Play Framework.
Transforms two data sets from Pica-XML and CSV to JSON-LD for Elasticsearch indexing. Data of two or more entries are merged if the DBS IDs (INR) of the entries are identical. The resulting data is used to build an index in Elasticsearch.
The resulting JSON is JSON-LD and therefore provides machine-readable Linked Data. The context file lists all used RDF properties and classes: http://lobid.org/organisations/context.jsonld
This repo replaces the lobid-organisations part of https://github.com/lobid/lodmill.
For information about the Lobid architecture and development process, see http://hbz.github.io/#lobid.
This section contains information about building and deploying the repo, running tests, and setting up Eclipse.
Prerequisites: Java 11, Maven 3 (verify with mvn -version); sbt
Create and change into a folder where you want to store the projects:
mkdir ~/git ; cd ~/git
Get lobid-organisations, set up the Play application, and run the tests:
git clone https://github.com/hbz/lobid-organisations.gitsbt cleansbt test
See the .github/workflows/build.yml file for details on the CI config used by Github Actions.
The Elasticsearch tests are defined in in test/controllers.
The Metafacture tests are defined in test/transformation/TestTransformAll.java and are based on transforming sample test file from the ISIL dump and the DBS export.
Short instructions for clean deployment, includes hbz-internal instructions that won't work outside the hbz network. Find detailed developer documentation further below.
After the build steps above, edit conf/application.conf as required (e.g. ports to be used by the embedded Elasticsearch).
On start up, the web app will looks in /tmp/lobid-organisations. If there is already data of minimum size the data is used to build an index. The "minimum size" is checked to prevent building up an index that only contains part of the available data or no data at all (e.g. if something goes wrong during the transformation the result may be an empty file). This minimum size threshold is specified in `conf/application.conf
If there is no data or not of minimum size the source data is transformed and the result is saved to /tmp/lobid-organisations. This data is used to build an Elasticsearch index.
These steps can be triggered separately using HTTP POST when the application is up and running. See below under Data-Trigger... how to do this.
To get the wikidate lookup table conf/wikidataLookup.tsv:
bash getWikidataLookupTableViaSparql.sh
The dbs.csv is not open data. It's provided by "bibliotheksstatistik.de"
via scp to /home/dbs/, once a week, and copied from there to:
app/transformation/input/
The sigel.dat is not open data. It's provided by "Zeitschriftendatenbank.de".
To get it internally, see checkIfBaseDataShouldBeUpdated.sh .
The minimum size threshold is specified in conf/application.conf.
Updates of the Sigel data can be fetched.The date of the base dump is set, e.g. 2013-06-01, in conf/application.conf. Updates will be downloaded from this date on until today.
Check if $JAVA_HOME variable is set
echo $JAVA_HOME
Set the variable to the home folder, not the path of your JAVA installation:
e.g.: export JAVA_HOME="/usr"
sbt cleansbt --java-home $JAVA_HOME stageJAVA_OPTS="$JAVA_OPTS -XX:+ExitOnOutOfMemoryError" ./target/universal/stage/bin/lobid-organisations -Dhttp.port=7201 -no-version-check
When startup is complete (Listening for HTTP on /0.0.0.0:7201), exit with Ctrl+D, output will be logged to target/universal/stage/logs/application.log.
The web application can also be accessed via http://lobid.org/organisations.
For monitoring config on quaoar1, see /etc/monit/conf.d/play-instances.rc. Monit logs to /var/log/monit.log. Check status with sudo monit status, reload config changes with sudo monit reload, for more see man monit.
This section contains additional information about the data workflows, indexing, and querying.
The source data sets are the Sigelverzeichnis ('Sigel', format: PicaPlus-XML) and the Deutsche Bibliotheksstatistik ('DBS', format: CSV). The transformation is implemented by a pipeline with 3 logical steps:
- Preprocess Sigel data, use DBS ID as record ID; if no DBS ID is available, use ISIL; in this step, updates are downloaded for the time period from base dump creation until today
- Preprocess DBS data, use DBS ID as record ID
- Combine all data from Sigel and DBS:
- Merge Sigel and DBS entries that have identical DBS IDs
- Entries with a unique DBS ID or without DBS ID are integrated as well --- they are not merged with any other entry
- The entries in the resulting data set have a URI with their ISIL as ID (e.g., http://lobid.org/organisations/DE-9). If no ISIL is available, a Pseudo-ISIL is generated consisting of the string 'DBS-' and the DBS ID (e.g., http://lobid.org/organisations/DBS-GX848).
Each of these steps has a corresponding Java class, Fix scripts, and output file.
Finally, the data is indexed in Elasticsearch. The ID of an organisation is represented as a URI (e.g., http://lobid.org/organisations/DE-9). However, when building up the index, the organisations are given the last bit of this URI only as Elasticsearch IDs (e.g., DE-9). Thus, Elasticsearch-internally, the organisations can be accessed via their ISIL or Pseudo-ISIL.
Transformation and indexing are done automatically when starting the application. However, both processes can be triggered separately, see below.
The transformation is triggered when the application starts but it can also be started separately when the application is running (only works hbz internally).
If you run the transformation with the full data (see above), the application will download additional updates for the Sigel data.
Thus, you will have to specify one parameter in @conf/application.conf@ : the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01).
You can run the transformation of the full data using the following command:
curl -X POST "http://localhost:9000/organisations/transform"
Indexing is triggered when the application starts but it can also be started separately when the application is running. You can use the following command to do so:
curl -X POST "http://localhost:9000/organisations/index"
Query the resulting index:
curl -XGET 'http://localhost:7211/organisations/_search?q=*'; echo
For details on the various options see the query string syntax documentation.
Get a specific record by DE-38:
curl -XGET 'http://localhost:7211/organisations/organisation/DE-38'; echo
Exclude the metadata (you can paste the resulting document into the JSON-LD Playground for conversion tests etc.):
curl -XGET 'http://localhost:7211/organisations/organisation/DE-38/_source'; echo
For details on the various options see the GET API documentation.
Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html