Multivariate Normative Modeling Kit (MNM-Kit)

General Information

Code Repository

The code is structured as follows:

The config directory contains configuration files required to run the project.
The data directory contains raw and processed data (not included in the repository due to privacy reasons).
The docker directory contains required files to run the project in a Docker container.
The docs directory contains documentation and literature.
The experiments directory contains input and output artifact per experiment.
The logs directory contains log files.
The models directory contains saved models.
The output directory contains output files (e.g., predictions, visualizations, etc.).
The scripts directory contains various scripts.
The src directory contains the source code.
The tests directory contains unit tests.

Please find more information on installing, running, and testing the project in the sections below.

Requirements

This project requires the following dependencies installed:

Python >= 3.10
Poetry >= 2.1.3

All dependencies README.md are listed in the pyproject.toml file.

Alternatively, Docker can be used to run the project. The Dockerfile is provided in the repository. This requires Docker to be installed on the host machine.

Setup

Directly on Host Machine

You can install the package locally using poetry:

poetry install --no-root

Using Docker

You can build the Docker image using the provided Dockerfile:

docker build -t mnmkit:latest -f ./docker/Dockerfile .

Legacy Setup

In environments where poetry is not available and where Docker is not an option, the project can be run using pip and virtualenv. The following steps can be used to set up the project:

Create a requirements.txt file using poetry from the pyproject.toml file:

poetry export --with=dev --with=test --without-hashes --without-urls | awk '{ print $1 }' FS=';' > requirements.txt

Create a virtual environment using virtualenv:

virtualenv venv && source venv/bin/activate

Install the dependencies using pip:
```
pip install -r requirements.txt
```
Replace all poetry run python commands with python in all subsequent commands.

Usage

Entry Point

The entry point of the project is the main.py file. Run the project using poetry:

poetry run python src/main.py --config [CONFIG_PATH] --mode [MODE] [OPTIONS]

You can also run the project in a Docker container after building the image (see the Setup section above):

docker run --rm \
           -v ./data:/data \
           -v ./output:/output \
           -v ./logs:/logs \
           -v ./models:/models \
           -v ./config:/config \
           mnmkit:latest \
           --config [CONFIG_PATH] \
           --mode [MODE] \
           [OPTIONS]

Arguments

The main.py file accepts the following required arguments:

--config: Path to the configuration file. Default: config/config_default.yml.
--mode: Mode to run the project in. Options: train, inference, validate, tune. Default: train.

The main.py file accepts the following optional arguments:

--checkpoint: Path to a checkpoint to load.
--checkpoint-interval: Interval to save checkpoints.
--data_dir: Directory to save data.
--debug: Debug mode flag.
--device: Device to run the project on. Options: cpu, cuda.
--log_dir: Directory to save logs.
--log_level: Log level. Options: DEBUG, INFO, WARNING, ERROR, CRITICAL.
--models_dir: Directory to save models.
--num_workers: Number of workers for data loading.
--output_dir: Directory to save output.
--skip-preprocessing: Skip preprocessing pipeline flag.
--seed: Seed for reproducibility.
--verbose: Verbosity flag.

Command line arguments override settings in the provided configuration file.

Datasets

Datasets must be provided in the data directory.

This project supports the following data types:

Tabular data in the following formats: csv, rds.
Image data in the following formats: jpg, jpeg, png, bmp, tiff. Images should be placed in a directory with the same name as the dataset file (e.g., dataset/<image_name>.jpg). If a test split is provided, the test split directory must be named test (e.g., dataset/test/<image_name>.jpg).

The preprocessing pipeline is automatically applied to the datasets. The configuration file contains settings for the preprocessing pipeline, such as split sizes, transformations, and shuffling. Processed datasets are saved in the data/processed directory.

The internal file format is hdf, but can be changed to csv in the configuration file for debugging purposes (not recommended for large datasets). To view HDF files, use the hdfview tool, see here.

To skip the preprocessing pipeline, use the --skip-preprocessing flag.

Configuration

The project configuration is defined in the config directory. The configuration file is in YAML format. The default configuration file is config/config_default.yml.

Important: Copy this file to create a custom configuration. The default configuration will be overwritten each time the application is run.

The configuration file contains the following sections:

dataset: Configuration for the dataset.
model: Configuration for the model.
train: Configuration for the training process.
inference: Configuration for the inference process.
validation: Configuration for the validation process.
meta: Meta information for the project.
general: General configuration for the project.
system: System configuration for the project.

The configuration is validated against the schema before a task is run. Missing values are filled with default values from the default configuration file.

Output Files

The application generates different types of artifacts during the execution of the project.

Logs: Logs are saved in the logs directory, unless a custom log directory is provided. Logs are saved in the log format.
Models: Models are saved in the models directory, unless a custom models directory is provided. Models are represented by their state dictionary and are saved in the pt format. Checkpoints are stored in the checkpoints subdirectory.
Output: Output files are saved in the output directory, unless a custom output directory is provided. Examples of output files are metrics, visualizations, and predictions.
Processed Data: Processed data is saved in the data/processed directory. Processed data is saved in the csv format.

Output files are not tracked in the repository. Output files and folders can be safely deleted. Additionally, all input and output files are copied to a timestamped experiment folder to keep track of experiments and to enable reproducability.

Scripts

The scripts directory contains various scripts:

The data scripts can be used to generate artificial data for testing purposes.
The slurm script are used to run the project on the (Snellius) HPC.
The sync script can be used to synchronize the required files to the HPC (requires established SSH connection).

Contributing

Dependency Management

Poetry is used for dependency management. Docs for Poetry can be found here.

Add a new dependency using poetry:
```
poetry add [DEPENDENCY]
```
To update a dependency, use the poetry command:
```
poetry update [DEPENDENCY]
```
To remove a dependency, use the poetry command:
```
poetry remove [DEPENDENCY]
```

If the .pyproject.toml is manually updated, make sure to run the following command to update the lock file and to install the new dependencies:

poetry lock && poetry install

Pre-Commit Hooks

Pre-commit hooks are used to ensure consistent and validated code style and quality. Docs for Pre-Commit can be found here. Pre-commit hooks are defined in the .pre-commit-config.yaml file.

Install the pre-commit hooks using poetry:
```
poetry run pre-commit install
```
Run the pre-commit hooks manually using poetry:
```
poetry run pre-commit run --all-files
```

Note: Code quality can also be manually checked before committing changes using the code quality tools described below in the Code Quality section.

Git Workflow

Create a new branch for a new feature or bug fix:
```
git checkout -b feature/feature_name
```

Commit changes to the branch:

git add .
git commit -m "Commit message"

Push the branch to the remote repository:
```
git push origin feature/feature_name
```
Create a pull request on GitHub and assign reviewers.

Code Style

PEP8 standard for python code is used as code style guide. Docs for PEP8 can be found here.
Code style is enforced using the pre-commit hooks and CI/CD pipelines. Manual checks are available using the code quality tools described below in the Code Quality section.
Type annotations are used to enforce type checking. Docs for type annotations can be found here.

Code Quality

Code quality is enforced using the bundled code quality tools and unit tests.

Different levels of code quality checks are available:

Manual checks are available using the code quality tools described below in the Code Quality section and the testing tools in the Testing section.
Pre-commit hooks are used to validate local changes before committing.
Remote CI/CD pipelines are used to ensure code quality and to run tests. The CI/CD pipeline is set up using GitHub Actions. The pipeline can be found in the .github/workflows directory.
Model validation is performed using the validate mode in the main.py file. This mode can be used to validate the model using a validation dataset.

Code Quality

Static Code Analysis

PyLint is used for static code analysis. Docs for PyLint can be found here. Settings for PyLint can be found in the .pylintrc file. Run PyLint using poetry:
```
poetry run pylint ./src
```
Ruff can be used for additional static code analysis. Docs for Ruff can be found here. Settings for Ruff can be found in the .ruff file. Run Ruff using poetry:
```
poetry run ruff check ./src
```

Static Type Checking

MyPy is used for static type checking. Docs for MyPy can be found here. Run MyPy using poetry:
```
poetry run mypy ./src
```

Import Sorting

isort is used for import sorting. Docs for isort can be found here. Run isort using poetry:
```
poetry run isort --diff ./src
```
To apply the changes to the files, use without the --diff flag:
```
poetry run isort ./src
```

Code Formatting

Black (PEP8 compliant) is used for code formatting. Docs for Black can be found here. Run Black using poetry:
```
poetry run black --check --diff ./src
```
To apply the changes to the files, use without the --check and --diff flags:
```
poetry run black ./src
```
Ruff (more aggressive, use with caution[!]) can be used for additional code formatting. Docs for Ruff can be found here. Settings for Ruff can be found in the .ruff file. Run Ruff using poetry:
```
poetry run ruff format --check --diff ./src
```
To apply the changes to the files, use without the --check and --diff flags:
```
poetry run ruff format ./src
```

Unused Code Cleaning

Autoflake can be used to remove unused imports and variables. Docs for Autoflake can be found here. Run Autoflake using poetry:

poetry run autoflake --remove-all-unused-imports --remove-unused-variables --in-place --check -r .

To apply the changes to the files, use without the --check flag:

poetry run autoflake --remove-all-unused-imports --remove-unused-variables --in-place -r .

Docstring Formatting

Docstringsformatter is used to format docstrings. Docs for Docstringsformatter can be found here. Run Docstringsformatter using poetry:
```
poetry run pydocstringformatter ./src
```
To apply the changes to the files, use the --write flag:
```
poetry run pydocstringformatter ./src --write
```

Upgrade Python Syntax (>=3.12 compatible)

PyUpgrade can be used to upgrade Python syntax. Docs for PyUpgrade can be found here. Run PyUpgrade using poetry:
```
poetry run pyupgrade --py312-plus
```

Code Security

Bandit is used for code vulnerability and security checks. Docs for Bandit can be found here. Settings for Bandit can be found in the .bandit file. Run Bandit using poetry:
```
poetry run bandit -r .
```

Code Metrics

Radon is used for code metrics. Docs for Radon can be found here. Various metrics can be calculated using Radon.
- Cyclomatic Complexity: Measures the complexity of the code. Run Radon using poetry:
```
poetry run radon cc ./src
```
- Maintainability Index: Measures the maintainability of the code. Run Radon using poetry:
```
poetry run radon mi ./src
```
- Halstead Metrics: Measures the complexity of the code. Run Radon using poetry:
```
poetry run radon hal ./src
```
- Raw Metrics: Measures the raw metrics of the code. Run Radon using poetry:
```
poetry run radon raw ./src
```
Xenon is used for automated code complexity checks. Xenon uses Radon under the hood. Docs for Xenon can be found here. Run Xenon using poetry:
```
poetry run xenon --max-absolute B --max-modules B --max-average A ./src
```
Meaning of the flags:
- Max Absolute: Maximum absolute complexity.
- Max Modules: Maximum complexity per module.
- Max Average: Maximum average complexity.

Testing

This code base uses different levels of testing to ensure code quality and functionality.

Unit testing: Unit tests are used to test individual components of the code base. Unit tests are written using pytest. Docs for pytest can be found here.

Run the unit tests using poetry:

poetry run pytest

Code coverage reports: Code coverage reports are generated using pytest-cov (wrapper for Coverage). Docs for pytest-cov can be found here.

Run the code coverage reports using poetry:

poetry run pytest --cov=src --cov-report=term --cov-report=html --cov-report=xml --cov-fail-under=80

Property-based testing: Property-based testing is used to test the code base against a wide range of scenarios. Property-based tests are written using hypothesis. Docs for Hypothesis can be found here. These tests are automatically executed together with pytest unit tests.

Hypothesis test statistics can be shown using the following command:

poetry run pytest --hypothesis-show-statistics

Projects Using MNM-Kit

Type of Project	Title	Author(s)	Link
MSc Thesis	VAE-based Multivariate Normative Modeling: An Investigation of Covariate Modeling Methods	Remy Duijsens	View Project

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions please contact me at

Email: [email protected]
LinkedIn: Remy Duijsens
GitHub: remdui

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
.github		.github
config		config
data/processed		data/processed
docker		docker
scripts/data		scripts/data
src		src
tests		tests
.bandit		.bandit
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.pytest.ini		.pytest.ini
.python-version		.python-version
.ruff.toml		.ruff.toml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

License

remdui/MultivariateNormativeModeling

Folders and files

Latest commit

History

Repository files navigation

Multivariate Normative Modeling Kit (MNM-Kit)

General Information

Code Repository

Requirements

Setup

Directly on Host Machine

Using Docker

Legacy Setup

Usage

Entry Point

Arguments

Datasets

Configuration

Output Files

Scripts

Contributing

Dependency Management

Pre-Commit Hooks

Git Workflow

Code Style

Code Quality

Code Quality

Static Code Analysis

Static Type Checking

Import Sorting

Code Formatting

Unused Code Cleaning

Docstring Formatting

Upgrade Python Syntax (>=3.12 compatible)

Code Security

Code Metrics

Testing

Projects Using MNM-Kit

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Uh oh!

Languages