Electronic Babylonian Library API

Setup

This project uses Dev Containers to provide a consistent, pre-configured development environment. Dev containers automatically handle all dependencies including Python, MongoDB, Poetry, Task, and the Rust compiler.

Prerequisites

Cloud Development

GitHub Codespaces

Local Development

Docker Desktop (or Docker Engine + Docker Compose)
VS Code with the Dev Containers extension

Getting Started

Configure Environment Variables: Copy .env.example to .env and add your credentials:
```
cp .env.example .env
```
The container will initially fail without valid environment variables. Edit .env with your Auth0, MongoDB, Sentry, and other service credentials (see environment variables section).
Open in Dev Container:
- In VS Code: Click "Reopen in Container" when prompted, or run "Dev Containers: Reopen in Container" from the Command Palette
- In GitHub Codespaces: Create a new codespace from your repository
Rebuild After Configuration: After adding your .env file, rebuild the container:
- VS Code: Run "Dev Containers: Rebuild Container" from the Command Palette
- GitHub Codespaces: Restart the workspace

The dev container automatically installs all dependencies including MongoDB 4.4, Poetry, Python packages, Task, and the Rust compiler. No manual installation required.

Manual Setup

If not using dev containers:

PyPy3.11 & pip
Task
MongoDB 4.4.4
Rust compiler (for libcst)

pip install poetry
poetry install --no-root --with dev

Auth0

An API and Application have to be setup in Auth0 and the API needs to have the Scopes listed below.

API Identifier, Application Domain (or the customdomain if one is used), and Application Signing Certificate are needed for the environment variables (see below). The whole certificate needs (everything in the field or the downloaded PEM file) has to be base64 encoded before being added to the environment variable.

Scopes

Corpus: write:texts, create:texts,

Fragmentarium: lemmatize:fragments, transliterate:fragments, annotate:fragments,

Bibliography: write:bibliography,

Dictionary: write:words, create:proper_nouns,

Legacy (currently unused) scopes

access:beta, read:texts, read:fragments, read:bibliography, read:words,

Folio scopes have the following format:

read:<Folio name>-folios

Fragments have additional scopes in the following format:

read:<Fragment group>-fragments

Access token authorization claims

The backend authorization layer accepts permissions from either claim in the access token:

scope (space-separated string)
permissions (array of scope strings)

Both sources are merged and unknown values are ignored.

Rules / Actions

Rules/Actions that add custom profile fields (for example eblName) are still supported.

Rules/Actions that copy permissions into the scope claim are no longer required for backend authorization, because the backend reads permissions directly.

The following legacy Rule examples are kept for reference.

eBL name:

function (user, context, callback) {
  const namespace = 'https://ebabylon.org/';
  context.idToken[namespace + 'eblName'] = user.user_metadata.eblName;
  callback(null, user, context);
}

Access token scopes (legacy compatibility example):

function (user, context, callback) {
  const permissions = user.permissions || [];
  const requestedScopes = context.request.body.scope || context.request.query.scope;
  context.accessToken.scope = requestedScopes
    .split(' ')
    .filter(scope => scope.indexOf(':') < 0)
    .concat(permissions)
    .join(' ');

  callback(null, user, context);
}

Users

The users should eblName property in the user_metadata. E.g.:

{
  "eblName": "Surname"
}

Sentry

An organization and project need to be setup in Sentry. DSN under Client Keys is needed for the for the environment variables (see below).

Environment Variables

The following are needed to run the application:

See the Auth0 and Sentry sections below for setup details.

Development

Running the tests

task format  # Format all files.
task format -- --check  # Check file formatting.
task lint  # Run linter.
task type  # Run type check
task test  # Run tests.  
task test -- -n auto  # Run tests in parallel.
task test -- --cov=ebl --cov-report term --cov-report xml  # Run tests with coverage (slow in PyPy).
task test-all  # Run format, lint and type checks, and tests.

See pytest-xdist documentation for more information on parallel tests. To avoid race condition when running the tests in parallel run poetry run python -m ebl.tests.downloader.

⚠️ Sometimes test results may differ for PyPy and non-PyPy Python (the latter is used for some automatic checks in this repository). If tests fail with non-PyPy Python alone, make sure to install and use the same Python version for debugging.

Custom Git Shortcut

task cp --- commit-message  # Runs black, flake8 and pyre-check and git add, commit and push

Codestyle

Use Black codestyle and PEP8 naming conventions. Line length is 88, and bugbear B950 is used instead of E501. PEP8 checks should be enabled in PyCharm, but E501, E203, and E231 should be disabled.

Use type hints in new code and add the to old code when making changes.

Package dependencies

Avoid directed package dependency cycles.
Domain packages should depend only on other domain packages.
Application packages should depend only on application and domain packages.
Web, infrastructure, etc. should depend only on application and domain packges.
All packages can depend on common modules in the top-level ebl package.

Dependencies can be analyzed with pydepgraph:

pydepgraph -p . -e tests -g 2 | dot -Tpng -o graph.png

Database

See dictionary-parser, proper-name-importer, fragmentarium-parser, and sign-list-parser about generating the initial data. There have been chanages to the database structure since the scripts were initally used and they most likely require updates to work with latest version of the API.

pull-db.sh script can be used to pull a database from an another MongoDB instance to your development MongoDB. It will use mongodump and mongorestore to get all data except changelog collection, and photos and folios buckets.

To make the use less tedious the scripts reads defaults from the following environment varaiables:

PULL_DB_DEFAULT_SOURCE_HOST=<source MongoDB host>
PULL_DB_DEFAULT_SOURCE_USER=<source MongoDB user>
PULL_DB_DEFAULT_SOURCE_PASSWORD=<source MongoDB password>

The test use pymongo_inmemory for tests. Depending on your OS it might be necessary to configure it in order to get the correct version of MongoDB. E.g. for Ubuntu add the following environment variables:

PYMONGOIM__MONGO_VERSION=4.4
PYMONGOIM__OPERATING_SYSTEM=ubuntu
PYMONGOIM__OS_VERSION=20

Caching

Falcon-Caching middleware can be used for caching. See the documentation for more information. Configuration is read from CACHE_CONFIG environment variable.

CACHE_CONFIG='{"CACHE_TYPE": "simple"}' poetry run waitress-serve --port=8000 --call ebl.app:get_app

Falcon-Caching v1.0.1 does not cache media. text must be used.

@cache.cached(timeout=DEFAULT_TIMEOUT)
def on_get(self, req, resp):
    resp.text = ...

cache-control decorator can be used to add Cache-Control header to responses.

@cache_control(['public', 'max-age=600'])
def on_get(self, req, resp):
    ...

A method to control when the header is added can be passed as the second argument.

@cache_control(['public', 'max-age=600'], lambda req, resp: req.auth is None)
def on_get(self, req, resp):
    ...

Authentication and Authorization

Auth0 and falcon-auth are used for authentication and authorization.

An endpoint can be protected using the @falcon.before decorator three ways:

@falcon.before(require_scope, "your scope name here"): Simple check if the user is allowed to use the endpoint. Dynamic checks based on the fetched data is not possible.
@falcon.before(require_folio_scope): Dynamically checks if the user can read folios based on the folio name from the url
@falcon.before(require_fragment_read_scope): Dynamically checks if the user can read individual fragments by comparing the authorized_scopes from the fragment with the user scopes

For example:

import falcon
from ebl.users.web.require_scope import require_scope, require_fragment_read_scope

@falcon.before(require_fragment_read_scope)
def on_get(self, req, resp):
    ...

@falcon.before(require_scope, "write:texts")
def on_post(self, req, resp):
    ...

Running the application

The application reads the configuration from following environment variables:

AUTH0_AUDIENCE=<the Identifier from Auth0 API Settings>
AUTH0_ISSUER=<the Domain from Auth Application Setttings, or the custom domain from Branding>
AUTH0_PEM=<Signing Certificate (PEM) from the Auth0 Application Advanced Settings. The whole certificate needs to be base64 encoded again before adding to environment.>
MONGODB_URI=<MongoDB connection URI with database>
MONGODB_DB=<MongoDB database. Optional, authentication database will be used as default.>
EBL_AI_API=<AI API URL. If you do not have access to and do not need the AI API use a safe dummy value.>
SENTRY_DSN=<Sentry DSN>
SENTRY_ENVIRONMENT=<development or production>
CACHE_CONFIG=<Falcon-Caching configuration. Optional, Null backend will be used as default.>

Poetry does not support .env-files. The environment variables need to be configured in the shell, unless ran via Task. Alternatively and external program can be used to handle the file e.g. direnv or Set-PsEnv.

With Dev Container (Recommended)

If using the dev container, MongoDB is already running locally. Start the application:

task start

Locally (Without Dev Container)

task start
# or
poetry run waitress-serve --port=8000 --call ebl.app:get_app

Docker image

Build and run the docker image:

docker build -t ebl/api .
docker run -p 8000:8000 --rm -it --env-file=FILE --name ebl-api ebl/api

If you need to run custom operations inside Docker you can start the shell:

docker run --rm -it --env-file=.env --name ebl-shell --mount type=bind,source="$(pwd)",target=/usr/src/ebl ebl/api bash

Docker Compose

Build the images:

docker-compose build

Run only the API:

docker-compose -f ./docker-compose-api-only.yml up

Run the full backend including the database and admin interface:

docker-compose up

⚠️ You must create a script to create the MongoDB user in ./docker-entrypoint-initdb.d/create-users.js before the the database is started for the first time.

db.createUser(
  {
    user: "ebl-api",
    pwd: "<password>",
    roles: [
       { role: "readWrite", db: "ebl" }
    ]
  }
)

In addition to the variables specified above, the following environment variables are needed:

MONGODB_URI=mongodb://ebl-api:<password>@mongo:27017/ebl`
MONGO_INITDB_ROOT_USERNAME=<Mongo root user>
MONGO_INITDB_ROOT_PASSWORD=<Mongo root user password>
MONGOEXPRESS_LOGIN=<Mongo Express login username>
MONGOEXPRESS_PASSWORD=<Mongo Express login password>

Updating data

Changes to the schemas or parsers can lead the data in the database to become obsolete. Below are instructions how to migrate Fragmentarium and Corpus to the latest state.

Fragmentarium

Improving the parser can lead to existing transliterations to become obsolete tokens or invalid. The signs are calculated when a fragment is saved, but if the sign list is updated the fragments are not automatically updated.

The ebl.fragmentarium.update_fragments module can be used to recreate transliteration and signs in all fragments. A list of invalid fragments is saved to invalid_fragments.tsv.

The script can be run locally:

poetry run python -m ebl.fragmentarium.update_fragments

, as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-updater ebl/api poetry run python -m ebl.fragmentarium.update_fragments

, or with docker-compose:

docker-compose -f ./docker-compose-updater.yml up

Corpus

The ebl.corpus.texts module can be used to save the texts with the latest schema. A list of invalid texts is saved to invalid_texts.tsv. The script saves the texts as is. Transliterations are not reparsed.

The script can be run locally:

poetry run python -m ebl.corpus.update_texts

, as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-corpus-updater ebl/api poetry run python -m ebl.corpus.update_texts

Alignment

The ebl.alignment.align_fragmentarium module can be used to save align all fragments in the Fragmentarium with the Corpus. The scripts accepts the following arguments:

-h, --help                     show this help message and exit
-s SKIP, --skip SKIP           Number of fragments to skip.
-l LIMIT, --limit LIMIT        Number of fragments to align.
--minScore MIN_SCORE           Minimum score to show in the results.
--maxLines MAX_LINES           Maximum size of fragment to align.
-o OUTPUT, --output OUTPUT     Filename for saving the results.
-w WORKERS, --workers WORKERS  Number of parallel workers.
-t, --threads                  Use threads instead of processes for workers.

The script can be run locally:

poetry run python -m ebl.alignment.align_fragmentarium

or as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-corpus-updater ebl/api poetry run python -m ebl.alignment.align_fragmentarium

Cropped Sign Images Migration

When annotation updates accumulate orphaned cropped sign images, the ebl.fragmentarium.migrate_cropped_images module can be used to clean up and regenerate all cropped images from annotations. This resolves synchronization issues where multiple images exist per annotation.

The script can be run locally:

poetry run python -m ebl.fragmentarium.migrate_cropped_images

, as stand alone container:

docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-migration ebl/api poetry run python -m ebl.fragmentarium.migrate_cropped_images

Steps to update the production database

Implement the new functionality.
Implement fallback to handle old data, if the new model is incompatible.
Test that fragments are updated correctly in the development database.
Deploy to production.
Run the migration script. Do not start the script until the deployment has been succesfully completed.
Fix invalid fragments.
Remove fallback logic.
Deploy to production.

Importing .atf files

Importing and conversion of external .atf files which are encoded according to the oracc and c-ATF standards to the eBL-ATF standard.

For a description of eBL-ATF see: eBL-ATF specification
For a list of differences between the ATF flavors see: eBL ATF and other ATF flavors

To run use:

poetry run python -m ebl.atf_importer.application.atf_importer [-h] -i INPUT -g GLOSSARIES_DIRECTORY -l LOGDIR [-a] [-s]

Command line options

-h shows help message and exits the script.
-i INPUT, --input INPUT : Path of the input directory (required).
-l LOGDIR, --logdir LOGDIR : Path of the log files directory (required).
-g GLODIR, --glodir GLODIR : Path to the glossaries (.glo files) directory (required).
-a AUTHOR, --author AUTHOR : Name of the author of the imported fragements. If not specified a name needs to be entered manually for every fragment (optional).
The importer always tries to import all .atf files from one given input -i folder. To every imported folder a path to a folder with glossary file(s) (.glo) must be specified via -g. You can also assign an author to all imported fragments which are processed in one run via the -a option. If -a is omitted the atf-importer will ask for an author for each imported fragment.

Example calls:

poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/data/input/" -l "ebl/atf_importer/data/logs/" -g  "ebl/atf_importer/data/glossary" -a "atf_importer"
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/data/input_cdli_atf/" -l "ebl/atf_importer/data/logs/" -g  "ebl/atf_importer/data/glossary" -a "test"
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/data/input_c_atf/" -l "ebl/atf_importer/data/logs/" -g  "ebl/atf_importer/data/glossary" -a "test"

Troubleshooting

If a fragment cannot be imported check the console output for errors. Also check the specified log folder (error_lines.txt,unparsable_lines_[fragment_file].txt, not_imported_files.txt) and see which lines could not be parsed. If lines are faulty, fix them manually and retry the import process. If tokes are not lemmatized correctly, check the log-file lemmatization_log.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 2,097 Commits
.devcontainer		.devcontainer
.github		.github
docs		docs
ebl		ebl
.coveragerc		.coveragerc
.env.example		.env.example
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.pyre_configuration		.pyre_configuration
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Taskfile.dist.yml		Taskfile.dist.yml
conftest.py		conftest.py
docker-compose-api-only.yml		docker-compose-api-only.yml
docker-compose-updater.yml		docker-compose-updater.yml
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
output.json		output.json
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pull-db.sh		pull-db.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Electronic Babylonian Library API

Table of contents

Setup

Prerequisites

Getting Started

Manual Setup

Auth0

Scopes

Legacy (currently unused) scopes

Access token authorization claims

Rules / Actions

Users

Sentry

Environment Variables

Development

Running the tests

Custom Git Shortcut

Codestyle

Package dependencies

Database

Caching

Authentication and Authorization

Running the application

With Dev Container (Recommended)

Locally (Without Dev Container)

Docker image

Docker Compose

Updating data

Fragmentarium

Corpus

Alignment

Cropped Sign Images Migration

Steps to update the production database

Importing .atf files

Command line options

Troubleshooting

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages