This project uses Dev Containers to provide a consistent, pre-configured development environment. Dev containers automatically handle all dependencies including Python, MongoDB, Poetry, Task, and the Rust compiler.
Cloud Development
Local Development
- Docker Desktop (or Docker Engine + Docker Compose)
- VS Code with the Dev Containers extension
-
Configure Environment Variables: Copy
.env.exampleto.envand add your credentials:cp .env.example .env
The container will initially fail without valid environment variables. Edit
.envwith your Auth0, MongoDB, Sentry, and other service credentials (see environment variables section). -
Open in Dev Container:
- In VS Code: Click "Reopen in Container" when prompted, or run "Dev Containers: Reopen in Container" from the Command Palette
- In GitHub Codespaces: Create a new codespace from your repository
-
Rebuild After Configuration: After adding your
.envfile, rebuild the container:- VS Code: Run "Dev Containers: Rebuild Container" from the Command Palette
- GitHub Codespaces: Restart the workspace
The dev container automatically installs all dependencies including MongoDB 4.4, Poetry, Python packages, Task, and the Rust compiler. No manual installation required.
If not using dev containers:
- PyPy3.11 & pip
- Task
- MongoDB 4.4.4
- Rust compiler (for libcst)
pip install poetry
poetry install --no-root --with devAn API and Application have to be setup in Auth0 and the API needs to have the Scopes listed below.
API Identifier, Application Domain (or the customdomain if one is used), and Application Signing Certificate are needed for the environment variables (see below). The whole certificate needs (everything in the field or the downloaded PEM file) has to be base64 encoded before being added to the environment variable.
Corpus:
write:texts,
create:texts,
Fragmentarium:
lemmatize:fragments,
transliterate:fragments,
annotate:fragments,
Bibliography:
write:bibliography,
Dictionary:
write:words,
create:proper_nouns,
access:beta,
read:texts,
read:fragments,
read:bibliography,
read:words,
Folio scopes have the following format:
read:<Folio name>-folios
Fragments have additional scopes in the following format:
read:<Fragment group>-fragments
The backend authorization layer accepts permissions from either claim in the access token:
scope(space-separated string)permissions(array of scope strings)
Both sources are merged and unknown values are ignored.
Rules/Actions that add custom profile fields (for example eblName) are still supported.
Rules/Actions that copy permissions into the scope claim are no longer required for backend authorization,
because the backend reads permissions directly.
The following legacy Rule examples are kept for reference.
eBL name:
function (user, context, callback) {
const namespace = 'https://ebabylon.org/';
context.idToken[namespace + 'eblName'] = user.user_metadata.eblName;
callback(null, user, context);
}Access token scopes (legacy compatibility example):
function (user, context, callback) {
const permissions = user.permissions || [];
const requestedScopes = context.request.body.scope || context.request.query.scope;
context.accessToken.scope = requestedScopes
.split(' ')
.filter(scope => scope.indexOf(':') < 0)
.concat(permissions)
.join(' ');
callback(null, user, context);
}The users should eblName property in the user_metadata. E.g.:
{
"eblName": "Surname"
}An organization and project need to be setup in Sentry. DSN under Client Keys is needed for the for the environment variables (see below).
The following are needed to run the application:
See the Auth0 and Sentry sections below for setup details.
task format # Format all files.
task format -- --check # Check file formatting.
task lint # Run linter.
task type # Run type check
task test # Run tests.
task test -- -n auto # Run tests in parallel.
task test -- --cov=ebl --cov-report term --cov-report xml # Run tests with coverage (slow in PyPy).
task test-all # Run format, lint and type checks, and tests.See pytest-xdist documentation
for more information on parallel tests. To avoid race condition when running
the tests in parallel run poetry run python -m ebl.tests.downloader.
task cp --- commit-message # Runs black, flake8 and pyre-check and git add, commit and pushUse Black codestyle and PEP8 naming conventions. Line length is 88, and bugbear B950 is used instead of E501. PEP8 checks should be enabled in PyCharm, but E501, E203, and E231 should be disabled.
Use type hints in new code and add the to old code when making changes.
- Avoid directed package dependency cycles.
- Domain packages should depend only on other domain packages.
- Application packages should depend only on application and domain packages.
- Web, infrastructure, etc. should depend only on application and domain packges.
- All packages can depend on common modules in the top-level ebl package.
Dependencies can be analyzed with pydepgraph:
pydepgraph -p . -e tests -g 2 | dot -Tpng -o graph.pngSee dictionary-parser, proper-name-importer, fragmentarium-parser, and sign-list-parser about generating the initial data. There have been chanages to the database structure since the scripts were initally used and they most likely require updates to work with latest version of the API.
pull-db.sh script can be used to pull a database from an another MongoDB instance to
your development MongoDB. It will use mongodump and mongorestore to get
all data except changelog collection, and photos and folios buckets.
To make the use less tedious the scripts reads defaults from the following environment varaiables:
PULL_DB_DEFAULT_SOURCE_HOST=<source MongoDB host>
PULL_DB_DEFAULT_SOURCE_USER=<source MongoDB user>
PULL_DB_DEFAULT_SOURCE_PASSWORD=<source MongoDB password>The test use pymongo_inmemory for tests. Depending on your OS it might be necessary to configure it in order to get the correct version of MongoDB. E.g. for Ubuntu add the following environment variables:
PYMONGOIM__MONGO_VERSION=4.4
PYMONGOIM__OPERATING_SYSTEM=ubuntu
PYMONGOIM__OS_VERSION=20Falcon-Caching middleware can be used for caching.
See the documentation for more information. Configuration is read from CACHE_CONFIG environment variable.
CACHE_CONFIG='{"CACHE_TYPE": "simple"}' poetry run waitress-serve --port=8000 --call ebl.app:get_appFalcon-Caching v1.0.1 does not cache media. text must be used.
@cache.cached(timeout=DEFAULT_TIMEOUT)
def on_get(self, req, resp):
resp.text = ...cache-control decorator can be used to add Cache-Control header to responses.
@cache_control(['public', 'max-age=600'])
def on_get(self, req, resp):
...A method to control when the header is added can be passed as the second argument.
@cache_control(['public', 'max-age=600'], lambda req, resp: req.auth is None)
def on_get(self, req, resp):
...Auth0 and falcon-auth are used for authentication and authorization.
An endpoint can be protected using the @falcon.before decorator three ways:
@falcon.before(require_scope, "your scope name here"): Simple check if the user is allowed to use the endpoint. Dynamic checks based on the fetched data is not possible.@falcon.before(require_folio_scope): Dynamically checks if the user can read folios based on the folio name from the url@falcon.before(require_fragment_read_scope): Dynamically checks if the user can read individual fragments by comparing theauthorized_scopesfrom the fragment with the user scopes
For example:
import falcon
from ebl.users.web.require_scope import require_scope, require_fragment_read_scope
@falcon.before(require_fragment_read_scope)
def on_get(self, req, resp):
...
@falcon.before(require_scope, "write:texts")
def on_post(self, req, resp):
...The application reads the configuration from following environment variables:
AUTH0_AUDIENCE=<the Identifier from Auth0 API Settings>
AUTH0_ISSUER=<the Domain from Auth Application Setttings, or the custom domain from Branding>
AUTH0_PEM=<Signing Certificate (PEM) from the Auth0 Application Advanced Settings. The whole certificate needs to be base64 encoded again before adding to environment.>
MONGODB_URI=<MongoDB connection URI with database>
MONGODB_DB=<MongoDB database. Optional, authentication database will be used as default.>
EBL_AI_API=<AI API URL. If you do not have access to and do not need the AI API use a safe dummy value.>
SENTRY_DSN=<Sentry DSN>
SENTRY_ENVIRONMENT=<development or production>
CACHE_CONFIG=<Falcon-Caching configuration. Optional, Null backend will be used as default.>Poetry does not support .env-files. The environment variables need to be configured in the shell, unless ran via Task. Alternatively and external program can be used to handle the file e.g. direnv or Set-PsEnv.
If using the dev container, MongoDB is already running locally. Start the application:
task starttask start
# or
poetry run waitress-serve --port=8000 --call ebl.app:get_appBuild and run the docker image:
docker build -t ebl/api .
docker run -p 8000:8000 --rm -it --env-file=FILE --name ebl-api ebl/apiIf you need to run custom operations inside Docker you can start the shell:
docker run --rm -it --env-file=.env --name ebl-shell --mount type=bind,source="$(pwd)",target=/usr/src/ebl ebl/api bashBuild the images:
docker-compose buildRun only the API:
docker-compose -f ./docker-compose-api-only.yml upRun the full backend including the database and admin interface:
docker-compose up./docker-entrypoint-initdb.d/create-users.js before the
the database is started for the first time.
db.createUser(
{
user: "ebl-api",
pwd: "<password>",
roles: [
{ role: "readWrite", db: "ebl" }
]
}
)In addition to the variables specified above, the following environment variables are needed:
MONGODB_URI=mongodb://ebl-api:<password>@mongo:27017/ebl`
MONGO_INITDB_ROOT_USERNAME=<Mongo root user>
MONGO_INITDB_ROOT_PASSWORD=<Mongo root user password>
MONGOEXPRESS_LOGIN=<Mongo Express login username>
MONGOEXPRESS_PASSWORD=<Mongo Express login password>Changes to the schemas or parsers can lead the data in the database to become obsolete. Below are instructions how to migrate Fragmentarium and Corpus to the latest state.
Improving the parser can lead to existing transliterations to become obsolete tokens or invalid. The signs are calculated when a fragment is saved, but if the sign list is updated the fragments are not automatically updated.
The ebl.fragmentarium.update_fragments module can be used to recreate
transliteration and signs in all fragments. A list of invalid fragments is
saved to invalid_fragments.tsv.
The script can be run locally:
poetry run python -m ebl.fragmentarium.update_fragments, as stand alone container:
docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-updater ebl/api poetry run python -m ebl.fragmentarium.update_fragments, or with docker-compose:
docker-compose -f ./docker-compose-updater.yml upThe ebl.corpus.texts module can be used to save the texts with the latest schema.
A list of invalid texts is saved to invalid_texts.tsv. The script saves the texts
as is. Transliterations are not reparsed.
The script can be run locally:
poetry run python -m ebl.corpus.update_texts, as stand alone container:
docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-corpus-updater ebl/api poetry run python -m ebl.corpus.update_textsThe ebl.alignment.align_fragmentarium module can be used to save align all
fragments in the Fragmentarium with the Corpus.
The scripts accepts the following arguments:
-h, --help show this help message and exit
-s SKIP, --skip SKIP Number of fragments to skip.
-l LIMIT, --limit LIMIT Number of fragments to align.
--minScore MIN_SCORE Minimum score to show in the results.
--maxLines MAX_LINES Maximum size of fragment to align.
-o OUTPUT, --output OUTPUT Filename for saving the results.
-w WORKERS, --workers WORKERS Number of parallel workers.
-t, --threads Use threads instead of processes for workers.
The script can be run locally:
poetry run python -m ebl.alignment.align_fragmentariumor as stand alone container:
docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-corpus-updater ebl/api poetry run python -m ebl.alignment.align_fragmentariumWhen annotation updates accumulate orphaned cropped sign images, the ebl.fragmentarium.migrate_cropped_images module
can be used to clean up and regenerate all cropped images from annotations. This resolves synchronization
issues where multiple images exist per annotation.
The script can be run locally:
poetry run python -m ebl.fragmentarium.migrate_cropped_images, as stand alone container:
docker build -t ebl/api .
docker run --rm -it --env-file=.env --name ebl-migration ebl/api poetry run python -m ebl.fragmentarium.migrate_cropped_images- Implement the new functionality.
- Implement fallback to handle old data, if the new model is incompatible.
- Test that fragments are updated correctly in the development database.
- Deploy to production.
- Run the migration script. Do not start the script until the deployment has been succesfully completed.
- Fix invalid fragments.
- Remove fallback logic.
- Deploy to production.
Importing and conversion of external .atf files which are encoded according to the oracc and c-ATF standards to the eBL-ATF standard.
- For a description of eBL-ATF see: eBL-ATF specification
- For a list of differences between the ATF flavors see: eBL ATF and other ATF flavors
To run use:
poetry run python -m ebl.atf_importer.application.atf_importer [-h] -i INPUT -g GLOSSARIES_DIRECTORY -l LOGDIR [-a] [-s]
-
-hshows help message and exits the script. -
-iINPUT,--inputINPUT : Path of the input directory (required). -
-lLOGDIR,--logdirLOGDIR : Path of the log files directory (required). -
-gGLODIR,--glodirGLODIR : Path to the glossaries (.glofiles) directory (required). -
-aAUTHOR,--authorAUTHOR : Name of the author of the imported fragements. If not specified a name needs to be entered manually for every fragment (optional). -
The importer always tries to import all
.atffiles from one given input-ifolder. To every imported folder a path to a folder with glossary file(s) (.glo) must be specified via-g. You can also assign an author to all imported fragments which are processed in one run via the-aoption. If-ais omitted the atf-importer will ask for an author for each imported fragment.
Example calls:
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/data/input/" -l "ebl/atf_importer/data/logs/" -g "ebl/atf_importer/data/glossary" -a "atf_importer"
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/data/input_cdli_atf/" -l "ebl/atf_importer/data/logs/" -g "ebl/atf_importer/data/glossary" -a "test"
poetry run python -m ebl.atf_importer.application.atf_importer -i "ebl/atf_importer/data/input_c_atf/" -l "ebl/atf_importer/data/logs/" -g "ebl/atf_importer/data/glossary" -a "test"If a fragment cannot be imported check the console output for errors. Also check the specified log folder (error_lines.txt,unparsable_lines_[fragment_file].txt, not_imported_files.txt) and see which lines could not be parsed.
If lines are faulty, fix them manually and retry the import process. If tokes are not lemmatized correctly, check the log-file lemmatization_log.txt.
CSL-JSON schema is based on citation-style-language/schema Copyright (c) 2007-2018 Citation Style Language and contributors. Licensed under MIT License.