-
Colab notebook for easy downstream inference.
Welcome to the ProtHyena repo!
Credit: much of the code is forked and extended from HyenaDNA and Safari.
For this repo, let's start with the dependancies that are needed.
- clone repo, cd into it
git clone https://github.com/ZHymLumine/ProtHyena.git
if you fail to run the command, you may need install git lfs for cloning large files. Or you can just downdoad the zip file.
- create a conda environment, with Python 3.8
conda create -n mRNA-hyena python=3.8
- The repo is developed with Pytorch 2.4, using cuda 12.4
conda install cuda -c nvidia/label/cuda-12.4.1
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
- install requirements:
pip install -r requirements.txt
- install Flash Attention, these notes will be helpful.
cd mRNAHyena
cd flash-attention
pip install -e . --no-build-isolation
- to pretrain a prothyena model, in
ProtHyenafolder, run
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/prot14m_hyena trainer.devices=1
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=mRNA/mRNA_hyena trainer.devices=1
Note: we have provided the pretrained checkpoint and dataset in the checkpoint and data folders in this repo for your convenience.
-
Download the checkpoint and put it into
checkpointfolder. Change thepretrained_model_pathin the fileexperiment/prot14m/{task}.yamlto the correct path on your computer. -
download dataset (or use the dataset in
datafolder. Change thedest_pathin the filedataset/{task}.yamlto the correct path on your computer. -
For specific tasks, run the command below:
- fluorescence
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/fluorescence trainer.devices=1- stability
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/stability trainer.devices=1- cleavage
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/cleavage trainer.devices=1- disorder
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/disorder trainer.devices=1- signal peptide
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/signalP trainer.devices=1- solubility
CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/solubility trainer.devices=1you can change the batch size through command line. e.g
CUDA_VISIBLE_DEVICES=0 python -m train experiment=prot14m/stability trainer.devices=1 dataset.batch_size=128 dataset.batch_size_eval=128or you can set these parameters in
configs/experiment/prot14m/{task}.yamlfor specific task.
To fine-tune on a new task, you need to create new configuration files in the pipeline, experiment, and dataset folders. You can follow the examples we provide in these folders.
For example, if you want to fine-tune a task called fold_class (you can name it anything, here we use {task_name} as a placeholder), you need to create the following files:
experiment/prot14m/{task_name}.yamlpipeline/{task_name}.yamldataset/{task_name}.yaml
- Change
/pipeline:in thedefaultssection to{task_name}. - Update
pretrained_model_pathto the correct path on your computer where the pretrained model is located. - Optionally, update the
metricsby checking the available ones insrc/tasks/metrics.py, or create a new one.
- Change
/dataset:in thedefaultssection to{task_name}. - If your task is at the protein sequence level (where a whole sequence gets a label), use:
decoder: _name_: nd mode: pool - If your task is at the residue level (where each amino acid has a label), use:
decoder: _name_: token
- Set
_name_anddataset_nameto{task_name}. - Set
dest_pathto the correct path where your data is stored. - Set
train_lento the number of training examples. - Create
train.csv,valid.csv, andtest.csvfiles in thedest_pathdirectory. These files should have two columns:seq(for the sequence) andlabel(for the label).
In src/dataloaders/dataset/protein_bench_dataset.py, create new Dataset class
example
class SignalPeptideDataset(Dataset):
def __init__(
self,
split,
max_length,
dataset_name="signalP",
d_output=2, # default binary classification
dest_path=None,
tokenizer=None,
tokenizer_name=None,
use_padding=True,
add_eos=False,
rc_aug=False,
return_augs=False,
return_mask=False,
):
self.split = split
self.max_length = max_length
self.use_padding = use_padding
self.tokenizer_name = tokenizer_name
self.tokenizer = tokenizer
self.return_augs = return_augs
self.add_eos = add_eos
self.d_output = d_output # needed for decoder to grab
self.rc_aug = rc_aug
self.return_mask = return_mask
# base_path = Path(dest_path) / split
csv_file = os.path.join(dest_path, f"{split}.csv")
self.data = pd.read_csv(csv_file)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
sequence = self.data.iloc[idx, 1]
label = int(self.data.iloc[idx, 0])
seq = self.tokenizer(sequence,
add_special_tokens=True if self.add_eos else False, # this is what controls adding eos
padding="max_length" if self.use_padding else "do_not_pad",
max_length=self.max_length,
truncation=True,
)
seq_ids = seq["input_ids"] # get input_ids
seq_ids = torch.LongTensor(seq_ids)
target = torch.LongTensor([label]) # offset by 1, includes eos
if self.return_mask:
return seq_ids, target, {'mask': torch.BoolTensor(seq['attention_mask'])}
else:
return seq_ids, target
In src/dataloaders/proteomics.py, create new dataloader class and import the Dataset class from src.dataloaders.dataset.protein_bench_dataset
from src.dataloaders.datasets.protein_bench_dataset import SignalPeptideDataset
class SignalPeptide(Prot14M):
_name_ = "signalP"
l_output = 0
def __init__(self, dest_path=None, tokenizer_name=None, dataset_config_name=None, d_output=2, max_length=1024, rc_aug=False,
max_length_val=None, max_length_test=None, cache_dir=None, val_ratio=0.0005, val_split_seed=2357,
add_eos=True, detokenize=False, val_only=False, batch_size=32, batch_size_eval=None, num_workers=1,
shuffle=False, pin_memory=False, drop_last=False, fault_tolerant=False, ddp=False,
fast_forward_epochs=None, fast_forward_batches=None,
total_size=None, remove_tail_ends=False, cutoff_train=0.1, cutoff_test=0.2,
*args, **kwargs):
self.dataset_config_name = dataset_config_name
self.tokenizer_name = tokenizer_name
self.rc_aug = rc_aug # reverse compliment augmentation
self.dest_path = dest_path
self.d_output = d_output # Set this correct
...
# Create all splits: torch datasets
self.dataset_train, self.dataset_val, self.dataset_test = [
SignalPeptideDataset(split=split,
max_length=max_len,
dest_path=self.dest_path,
d_output=self.d_output,
tokenizer=self.tokenizer, # pass the tokenize wrapper
tokenizer_name=self.tokenizer_name,
add_eos=self.add_eos,
rc_aug=self.rc_aug,
)
for split, max_len in zip(['train', 'test', 'test'], [self.max_length, self.max_length_val, self.max_length_test])
]
return
Make sure that the _name_ matches your specific {task_name}. Set d_output to the number of classes for multi-class datasets, and use d_output = 1 for regression tasks.
If you'd like to use our fine-tuned model for downstream analysis (inference), follow our Colab notebook. The notebook is fully integrated with Hugging Face and provides everything you need to:
- Load the model and fine-tuned weights.
- Run inference on new data.
- Extract embeddings from protein sequences.
This notebook serves as a self-contained environment to streamline your workflow for prediction and further analysis.
Feel free to cite us if you find our work useful :)
@article{zhang2025hyena,
title={Hyena architecture enables fast and efficient protein language modeling},
author={Zhang, Yiming and Bian, Bian and Okumura, Manabu},
journal={IMetaOmics},
volume={2},
number={1},
pages={e45},
year={2025},
publisher={Wiley Online Library}
}