Skip to content

abby-ql/phage-ai-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phage AI Detector

ML pipeline to classify AI-generated vs natural phage genomes using chunked sequence features:

  • simple chunk stats (length, GC%, homopolymer)
  • k-mer counts (k=4) → dimensionality reduction (TruncatedSVD)
  • Group-aware evaluation (GroupKFold) and parent-level holdout split

Project layout

.
├── data/
│   └── raw/
│       ├── phage_sequences_all.csv (AI generated)
│       ├── phage_sequences_working.csv (AI generated)
│       ├── sequences.fasta (natural)
│       └── sequences.csv (natural)
├── notebooks/                  # jupyter notebook
├── scripts/                    # CLI scripts for repeatable runs
├── src/phage_ai_detector/      # reusable library code
└── outputs/                    # runtime artifacts

Note: the AI generated sequences are intentionally left out of the repo

Quickstart

1) Install

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Optional dependencies:

pip install xgboost shap

2) Run a chunk-size grid search (GroupKFold)

python scripts/run_grid_search.py --out outputs/chunk_grid.csv

# fast smoke test
python scripts/run_grid_search.py --max-chunks-per-parent 5 --chunk-max 800 --out outputs/chunk_grid_smoke.csv

3) Train & evaluate a baseline model on a parent-level holdout

python scripts/train_and_evaluate.py --chunk-size 2000 --model LogisticRegression --outdir outputs/run_lr

# fast smoke test
python scripts/train_and_evaluate.py --chunk-size 800 --max-chunks-per-parent 5 --outdir outputs/run_smoke

Outputs include metrics.json, test_predictions.csv, and a serialized model.joblib.

Notebooks

The notebook under notebook/ implements the same ML pipeline used by the scripts and CLI, using the same logic, features, and models, and it exists as a single self-contained, exploratory version of the pipeline ( with some more visualisation). The experimentation and validation in the notebook were first done in Google Colab, before the pipeline was automated and run locally via scripts and the CLI.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors