This repository provides a light-weight, fully documented baseline for the CAFA 6 protein function prediction challenge. The goal is to map amino acid sequences to Gene Ontology (GO) terms. The supplied code focuses on clarity and reproducibility so that it can serve as a starting point for further experimentation.
- Pure Python / scikit-learn implementation with no specialized hardware requirements.
- Simple amino-acid composition features that work for sequences of arbitrary length.
- End-to-end command line tools for training and inference.
- Predicts calibrated probabilities that can be converted directly into the submission TSV format.
.
├── README.md
├── requirements.txt
├── src/
│ └── protein_function/
│ ├── __init__.py
│ ├── data.py
│ ├── features.py
│ ├── model.py
│ ├── predict.py
│ └── train.py
└── tests/ # optional, add your own unit tests here
Create a virtual environment (recommended) and install the dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe baseline expects two kinds of input files:
- Training FASTA file – Contains amino acid sequences whose functions are known.
- Annotation CSV file – Contains two columns:
sequence_idandgo_term. Each row associates a sequence with a GO term. Sequences may appear multiple times, once per GO term.
At inference time you only need a FASTA file with the sequences to annotate.
python -m protein_function.train \
--fasta data/train_sequences.fasta \
--annotations data/train_annotations.csv \
--output models/baseline.joblibThe command:
- Parses the FASTA file and annotations.
- Extracts amino-acid composition features.
- Trains a multi-label logistic regression model.
- Saves the fitted model (classifier + label encoder) into a
.joblibbundle.
python -m protein_function.predict \
--model models/baseline.joblib \
--fasta data/test_sequences.fasta \
--output predictions.tsv \
--top-k 25The prediction script outputs a tab-separated file that follows the competition submission format:
sequence_id\tGO:0000001\t0.123
sequence_id\tGO:0000002\t0.056
...
You can control the number of GO terms emitted per sequence via --top-k, or specify a probability threshold with --min-proba.
- Swap the simple feature extractor with a deep representation (e.g., ESM embeddings).
- Replace the linear classifier with gradient boosted trees or neural networks.
- Add ontology-aware post-processing, such as propagating scores to parent GO terms.
This project is released under the MIT License. See LICENSE for details.