The code is structured as follows:
- The
configdirectory contains configuration files required to run the project. - The
datadirectory contains raw and processed data (not included in the repository due to privacy reasons). - The
dockerdirectory contains required files to run the project in a Docker container. - The
docsdirectory contains documentation and literature. - The
experimentsdirectory contains input and output artifact per experiment. - The
logsdirectory contains log files. - The
modelsdirectory contains saved models. - The
outputdirectory contains output files (e.g., predictions, visualizations, etc.). - The
scriptsdirectory contains various scripts. - The
srcdirectory contains the source code. - The
testsdirectory contains unit tests.
Please find more information on installing, running, and testing the project in the sections below.
This project requires the following dependencies installed:
- Python >= 3.10
- Poetry >= 2.1.3
All dependencies README.md are listed in the pyproject.toml file.
Alternatively, Docker can be used to run the project. The Dockerfile is provided in the repository. This requires Docker to be installed on the host machine.
You can install the package locally using poetry:
poetry install --no-rootYou can build the Docker image using the provided Dockerfile:
docker build -t mnmkit:latest -f ./docker/Dockerfile .In environments where poetry is not available and where Docker is not an option, the project can be run using pip and virtualenv. The following steps can be used to set up the project:
-
Create a requirements.txt file using
poetryfrom thepyproject.tomlfile:poetry export --with=dev --with=test --without-hashes --without-urls | awk '{ print $1 }' FS=';' > requirements.txt
-
Create a virtual environment using
virtualenv:virtualenv venv && source venv/bin/activate
-
Install the dependencies using
pip:pip install -r requirements.txt
-
Replace all
poetry run pythoncommands withpythonin all subsequent commands.
The entry point of the project is the main.py file. Run the project using poetry:
poetry run python src/main.py --config [CONFIG_PATH] --mode [MODE] [OPTIONS]You can also run the project in a Docker container after building the image (see the Setup section above):
docker run --rm \
-v ./data:/data \
-v ./output:/output \
-v ./logs:/logs \
-v ./models:/models \
-v ./config:/config \
mnmkit:latest \
--config [CONFIG_PATH] \
--mode [MODE] \
[OPTIONS]The main.py file accepts the following required arguments:
--config: Path to the configuration file. Default:config/config_default.yml.--mode: Mode to run the project in. Options:train,inference,validate,tune. Default:train.
The main.py file accepts the following optional arguments:
--checkpoint: Path to a checkpoint to load.--checkpoint-interval: Interval to save checkpoints.--data_dir: Directory to save data.--debug: Debug mode flag.--device: Device to run the project on. Options:cpu,cuda.--log_dir: Directory to save logs.--log_level: Log level. Options:DEBUG,INFO,WARNING,ERROR,CRITICAL.--models_dir: Directory to save models.--num_workers: Number of workers for data loading.--output_dir: Directory to save output.--skip-preprocessing: Skip preprocessing pipeline flag.--seed: Seed for reproducibility.--verbose: Verbosity flag.
Command line arguments override settings in the provided configuration file.
Datasets must be provided in the data directory.
This project supports the following data types:
-
Tabular data in the following formats:
csv,rds. -
Image data in the following formats:
jpg,jpeg,png,bmp,tiff. Images should be placed in a directory with the same name as the dataset file (e.g.,dataset/<image_name>.jpg). If a test split is provided, the test split directory must be namedtest(e.g.,dataset/test/<image_name>.jpg).
The preprocessing pipeline is automatically applied to the datasets. The configuration file contains settings for the preprocessing pipeline, such as split sizes, transformations, and shuffling.
Processed datasets are saved in the data/processed directory.
The internal file format is hdf, but can be changed to csv in the configuration file for debugging purposes (not recommended for large datasets). To view HDF files, use the hdfview tool, see here.
To skip the preprocessing pipeline, use the --skip-preprocessing flag.
The project configuration is defined in the config directory. The configuration file is in YAML format. The default configuration file is config/config_default.yml.
Important: Copy this file to create a custom configuration. The default configuration will be overwritten each time the application is run.
The configuration file contains the following sections:
dataset: Configuration for the dataset.model: Configuration for the model.train: Configuration for the training process.inference: Configuration for the inference process.validation: Configuration for the validation process.meta: Meta information for the project.general: General configuration for the project.system: System configuration for the project.
The configuration is validated against the schema before a task is run. Missing values are filled with default values from the default configuration file.
The application generates different types of artifacts during the execution of the project.
- Logs: Logs are saved in the
logsdirectory, unless a custom log directory is provided. Logs are saved in thelogformat. - Models: Models are saved in the
modelsdirectory, unless a custom models directory is provided. Models are represented by their state dictionary and are saved in theptformat. Checkpoints are stored in thecheckpointssubdirectory. - Output: Output files are saved in the
outputdirectory, unless a custom output directory is provided. Examples of output files are metrics, visualizations, and predictions. - Processed Data: Processed data is saved in the
data/processeddirectory. Processed data is saved in thecsvformat.
Output files are not tracked in the repository. Output files and folders can be safely deleted. Additionally, all input and output files are copied to a timestamped experiment folder to keep track of experiments and to enable reproducability.
The scripts directory contains various scripts:
- The data scripts can be used to generate artificial data for testing purposes.
- The slurm script are used to run the project on the (Snellius) HPC.
- The sync script can be used to synchronize the required files to the HPC (requires established SSH connection).
Poetry is used for dependency management. Docs for Poetry can be found here.
-
Add a new dependency using
poetry:poetry add [DEPENDENCY]
-
To update a dependency, use the
poetrycommand:poetry update [DEPENDENCY]
-
To remove a dependency, use the
poetrycommand:poetry remove [DEPENDENCY]
If the .pyproject.toml is manually updated, make sure to run the following command to update the lock file and to install the new dependencies:
poetry lock && poetry installPre-commit hooks are used to ensure consistent and validated code style and quality. Docs for Pre-Commit can be found here. Pre-commit hooks are defined in the .pre-commit-config.yaml file.
-
Install the pre-commit hooks using
poetry:poetry run pre-commit install
-
Run the pre-commit hooks manually using
poetry:poetry run pre-commit run --all-files
Note: Code quality can also be manually checked before committing changes using the code quality tools described below in the Code Quality section.
-
Create a new branch for a new feature or bug fix:
git checkout -b feature/feature_name
-
Commit changes to the branch:
git add . git commit -m "Commit message"
-
Push the branch to the remote repository:
git push origin feature/feature_name
-
Create a pull request on GitHub and assign reviewers.
PEP8standard for python code is used as code style guide. Docs for PEP8 can be found here.- Code style is enforced using the pre-commit hooks and CI/CD pipelines. Manual checks are available using the code quality tools described below in the
Code Qualitysection. - Type annotations are used to enforce type checking. Docs for type annotations can be found here.
Code quality is enforced using the bundled code quality tools and unit tests.
Different levels of code quality checks are available:
- Manual checks are available using the code quality tools described below in the
Code Qualitysection and the testing tools in theTestingsection. Pre-commit hooksare used to validate local changes before committing.- Remote CI/CD pipelines are used to ensure code quality and to run tests. The CI/CD pipeline is set up using GitHub Actions. The pipeline can be found in the
.github/workflowsdirectory. - Model validation is performed using the
validatemode in themain.pyfile. This mode can be used to validate the model using a validation dataset.
-
PyLint is used for static code analysis. Docs for PyLint can be found here. Settings for PyLint can be found in the
.pylintrcfile. Run PyLint usingpoetry:poetry run pylint ./src
-
Ruff can be used for additional static code analysis. Docs for Ruff can be found here. Settings for Ruff can be found in the
.rufffile. Run Ruff usingpoetry:poetry run ruff check ./src
-
MyPy is used for static type checking. Docs for MyPy can be found here. Run MyPy using
poetry:poetry run mypy ./src
-
isort is used for import sorting. Docs for isort can be found here. Run isort using
poetry:poetry run isort --diff ./src
To apply the changes to the files, use without the
--diffflag:poetry run isort ./src
-
Black (PEP8 compliant) is used for code formatting. Docs for Black can be found here. Run Black using
poetry:poetry run black --check --diff ./src
To apply the changes to the files, use without the
--checkand--diffflags:poetry run black ./src
-
Ruff (more aggressive, use with caution[!]) can be used for additional code formatting. Docs for Ruff can be found here. Settings for Ruff can be found in the
.rufffile. Run Ruff usingpoetry:poetry run ruff format --check --diff ./src
To apply the changes to the files, use without the
--checkand--diffflags:poetry run ruff format ./src
-
Autoflake can be used to remove unused imports and variables. Docs for Autoflake can be found here. Run Autoflake using
poetry:poetry run autoflake --remove-all-unused-imports --remove-unused-variables --in-place --check -r .To apply the changes to the files, use without the
--checkflag:poetry run autoflake --remove-all-unused-imports --remove-unused-variables --in-place -r .
-
Docstringsformatter is used to format docstrings. Docs for Docstringsformatter can be found here. Run Docstringsformatter using
poetry:poetry run pydocstringformatter ./src
To apply the changes to the files, use the
--writeflag:poetry run pydocstringformatter ./src --write
-
PyUpgrade can be used to upgrade Python syntax. Docs for PyUpgrade can be found here. Run PyUpgrade using
poetry:poetry run pyupgrade --py312-plus
-
Bandit is used for code vulnerability and security checks. Docs for Bandit can be found here. Settings for Bandit can be found in the
.banditfile. Run Bandit usingpoetry:poetry run bandit -r .
-
Radonis used for code metrics. Docs for Radon can be found here. Various metrics can be calculated using Radon.-
Cyclomatic Complexity: Measures the complexity of the code. Run Radon using
poetry:poetry run radon cc ./src
-
Maintainability Index: Measures the maintainability of the code. Run Radon using
poetry:poetry run radon mi ./src
-
Halstead Metrics: Measures the complexity of the code. Run Radon using
poetry:poetry run radon hal ./src
-
Raw Metrics: Measures the raw metrics of the code. Run Radon using
poetry:poetry run radon raw ./src
-
-
Xenonis used for automated code complexity checks. Xenon uses Radon under the hood. Docs for Xenon can be found here. Run Xenon usingpoetry:poetry run xenon --max-absolute B --max-modules B --max-average A ./src
Meaning of the flags:
- Max Absolute: Maximum absolute complexity.
- Max Modules: Maximum complexity per module.
- Max Average: Maximum average complexity.
This code base uses different levels of testing to ensure code quality and functionality.
- Unit testing: Unit tests are used to test individual components of the code base. Unit tests are written using
pytest. Docs for pytest can be found here.
Run the unit tests using poetry:
poetry run pytest- Code coverage reports: Code coverage reports are generated using
pytest-cov(wrapper for Coverage). Docs for pytest-cov can be found here.
Run the code coverage reports using poetry:
poetry run pytest --cov=src --cov-report=term --cov-report=html --cov-report=xml --cov-fail-under=80- Property-based testing: Property-based testing is used to test the code base against a wide range of scenarios. Property-based tests are written using
hypothesis. Docs for Hypothesis can be found here. These tests are automatically executed together with pytest unit tests.
Hypothesis test statistics can be shown using the following command:
poetry run pytest --hypothesis-show-statistics| Type of Project | Title | Author(s) | Link |
|---|---|---|---|
| MSc Thesis | VAE-based Multivariate Normative Modeling: An Investigation of Covariate Modeling Methods | Remy Duijsens | View Project |
This project is licensed under the MIT License - see the LICENSE file for details.
For questions please contact me at
- Email: [email protected]
- LinkedIn: Remy Duijsens
- GitHub: remdui