CAFA 6 Protein Function Prediction Baseline

This repository provides a light-weight, fully documented baseline for the CAFA 6 protein function prediction challenge. The goal is to map amino acid sequences to Gene Ontology (GO) terms. The supplied code focuses on clarity and reproducibility so that it can serve as a starting point for further experimentation.

Features

Pure Python / scikit-learn implementation with no specialized hardware requirements.
Simple amino-acid composition features that work for sequences of arbitrary length.
End-to-end command line tools for training and inference.
Predicts calibrated probabilities that can be converted directly into the submission TSV format.

Repository structure

.
├── README.md
├── requirements.txt
├── src/
│   └── protein_function/
│       ├── __init__.py
│       ├── data.py
│       ├── features.py
│       ├── model.py
│       ├── predict.py
│       └── train.py
└── tests/            # optional, add your own unit tests here

Installation

Create a virtual environment (recommended) and install the dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Input data format

The baseline expects two kinds of input files:

Training FASTA file – Contains amino acid sequences whose functions are known.
Annotation CSV file – Contains two columns: sequence_id and go_term. Each row associates a sequence with a GO term. Sequences may appear multiple times, once per GO term.

At inference time you only need a FASTA file with the sequences to annotate.

Training a model

python -m protein_function.train \
    --fasta data/train_sequences.fasta \
    --annotations data/train_annotations.csv \
    --output models/baseline.joblib

The command:

Parses the FASTA file and annotations.
Extracts amino-acid composition features.
Trains a multi-label logistic regression model.
Saves the fitted model (classifier + label encoder) into a .joblib bundle.

Generating predictions

python -m protein_function.predict \
    --model models/baseline.joblib \
    --fasta data/test_sequences.fasta \
    --output predictions.tsv \
    --top-k 25

The prediction script outputs a tab-separated file that follows the competition submission format:

sequence_id\tGO:0000001\t0.123
sequence_id\tGO:0000002\t0.056
...

You can control the number of GO terms emitted per sequence via --top-k, or specify a probability threshold with --min-proba.

Extending the baseline

Swap the simple feature extractor with a deep representation (e.g., ESM embeddings).
Replace the linear classifier with gradient boosted trees or neural networks.
Add ontology-aware post-processing, such as propagating scores to parent GO terms.

License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/protein_function		src/protein_function
.gitignore		.gitignore
CAFA 6 Protein Function Prediction.txt		CAFA 6 Protein Function Prediction.txt
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAFA 6 Protein Function Prediction Baseline

Features

Repository structure

Installation

Input data format

Training a model

Generating predictions

Extending the baseline

License

About

Uh oh!

Releases

Packages

Languages

License

UTTARASH/CAFA-6-Protein-Function-Prediction

Folders and files

Latest commit

History

Repository files navigation

CAFA 6 Protein Function Prediction Baseline

Features

Repository structure

Installation

Input data format

Training a model

Generating predictions

Extending the baseline

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages