We present MPRA-MNIST: a standardized dataset and toolkit. This resource integrates rigorously preprocessed MPRA data from seminal studies, preserving experimental fidelity while providing:
-
Consistent Formats: Ready-to-use sequences, activity scores, and metadata (CSV, FASTA, PyTorch).
-
Reproducible Pipelines: Transparent preprocessing code with version-controlled dependencies.
-
ML Compatibility: Structured for classification/regression tasks in frameworks like scikit-learn.
By eliminating data-wrangling barriers, MPRA-MNIST enables rapid algorithm validation—shifting focus from technical debt to biological discovery.
- OS: Ubuntu 20.04.6 LTS x86_64
- CUDA: 12.6
- Python: 3.12.7
- PyTorch: 2.7.1+cu126
-
Clone the repository:
git clone https://github.com/autosome-imtf/MPRA-MNIST cd MPRA-MNIST -
Create a Virtual Environment
conda create -n mpramnist python=3.12.7 conda activate mpramnist pip install torch
-
Install dependencies:
pip install --upgrade pip pip install -r requirements.txt
-
Install the package in editable mode (for development):
pip install setuptools wheel python setup.py sdist bdist_wheel pip install -e .