ML pipeline to classify AI-generated vs natural phage genomes using chunked sequence features:
- simple chunk stats (length, GC%, homopolymer)
- k-mer counts (k=4) → dimensionality reduction (TruncatedSVD)
- Group-aware evaluation (GroupKFold) and parent-level holdout split
.
├── data/
│ └── raw/
│ ├── phage_sequences_all.csv (AI generated)
│ ├── phage_sequences_working.csv (AI generated)
│ ├── sequences.fasta (natural)
│ └── sequences.csv (natural)
├── notebooks/ # jupyter notebook
├── scripts/ # CLI scripts for repeatable runs
├── src/phage_ai_detector/ # reusable library code
└── outputs/ # runtime artifacts
Note: the AI generated sequences are intentionally left out of the repo
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .Optional dependencies:
pip install xgboost shappython scripts/run_grid_search.py --out outputs/chunk_grid.csv
# fast smoke test
python scripts/run_grid_search.py --max-chunks-per-parent 5 --chunk-max 800 --out outputs/chunk_grid_smoke.csvpython scripts/train_and_evaluate.py --chunk-size 2000 --model LogisticRegression --outdir outputs/run_lr
# fast smoke test
python scripts/train_and_evaluate.py --chunk-size 800 --max-chunks-per-parent 5 --outdir outputs/run_smokeOutputs include metrics.json, test_predictions.csv, and a serialized model.joblib.
The notebook under notebook/ implements the same ML pipeline used by the scripts and CLI, using the same logic, features, and models, and it exists as a single self-contained, exploratory version of the pipeline ( with some more visualisation). The experimentation and validation in the notebook were first done in Google Colab, before the pipeline was automated and run locally via scripts and the CLI.