Phage AI Detector

ML pipeline to classify AI-generated vs natural phage genomes using chunked sequence features:

simple chunk stats (length, GC%, homopolymer)
k-mer counts (k=4) → dimensionality reduction (TruncatedSVD)
Group-aware evaluation (GroupKFold) and parent-level holdout split

Project layout

.
├── data/
│   └── raw/
│       ├── phage_sequences_all.csv (AI generated)
│       ├── phage_sequences_working.csv (AI generated)
│       ├── sequences.fasta (natural)
│       └── sequences.csv (natural)
├── notebooks/                  # jupyter notebook
├── scripts/                    # CLI scripts for repeatable runs
├── src/phage_ai_detector/      # reusable library code
└── outputs/                    # runtime artifacts

Note: the AI generated sequences are intentionally left out of the repo

Quickstart

1) Install

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Optional dependencies:

pip install xgboost shap

2) Run a chunk-size grid search (GroupKFold)

python scripts/run_grid_search.py --out outputs/chunk_grid.csv

# fast smoke test
python scripts/run_grid_search.py --max-chunks-per-parent 5 --chunk-max 800 --out outputs/chunk_grid_smoke.csv

3) Train & evaluate a baseline model on a parent-level holdout

python scripts/train_and_evaluate.py --chunk-size 2000 --model LogisticRegression --outdir outputs/run_lr

# fast smoke test
python scripts/train_and_evaluate.py --chunk-size 800 --max-chunks-per-parent 5 --outdir outputs/run_smoke

Outputs include metrics.json, test_predictions.csv, and a serialized model.joblib.

Notebooks

The notebook under notebook/ implements the same ML pipeline used by the scripts and CLI, using the same logic, features, and models, and it exists as a single self-contained, exploratory version of the pipeline ( with some more visualisation). The experimentation and validation in the notebook were first done in Google Colab, before the pipeline was automated and run locally via scripts and the CLI.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/raw		data/raw
outputs		outputs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phage AI Detector

Project layout

Quickstart

1) Install

2) Run a chunk-size grid search (GroupKFold)

3) Train & evaluate a baseline model on a parent-level holdout

Notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phage AI Detector

Project layout

Quickstart

1) Install

2) Run a chunk-size grid search (GroupKFold)

3) Train & evaluate a baseline model on a parent-level holdout

Notebooks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages