ProtHyena

Important links:

biorxiv
Dataset
Model checkpoint
Colab notebook for easy downstream inference.

Intro

Welcome to the ProtHyena repo!

Credit: much of the code is forked and extended from HyenaDNA and Safari.

Dependencies

For this repo, let's start with the dependancies that are needed.

clone repo, cd into it

git clone https://github.com/ZHymLumine/ProtHyena.git

if you fail to run the command, you may need install git lfs for cloning large files. Or you can just downdoad the zip file.

create a conda environment, with Python 3.8

conda create -n mRNA-hyena python=3.8

The repo is developed with Pytorch 2.4, using cuda 12.4

conda install cuda -c nvidia/label/cuda-12.4.1
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

install requirements:

pip install -r requirements.txt

install Flash Attention, these notes will be helpful.

cd mRNAHyena
cd flash-attention
pip install -e . --no-build-isolation

Pretrain

to pretrain a prothyena model, in ProtHyena folder, run

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/prot14m_hyena trainer.devices=1

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=mRNA/mRNA_hyena trainer.devices=1

Fine-tuning

Note: we have provided the pretrained checkpoint and dataset in the checkpoint and data folders in this repo for your convenience.

Download the checkpoint and put it into checkpoint folder. Change the pretrained_model_path in the file experiment/prot14m/{task}.yaml to the correct path on your computer.
download dataset (or use the dataset in data folder. Change the dest_path in the file dataset/{task}.yaml to the correct path on your computer.

For specific tasks, run the command below:

fluorescence

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/fluorescence trainer.devices=1

stability

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/stability trainer.devices=1

cleavage

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/cleavage trainer.devices=1

disorder

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/disorder trainer.devices=1

signal peptide

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/signalP trainer.devices=1

solubility

CUDA_VISIBLE_DEVICES=0 python -m train wandb=null experiment=prot14m/solubility trainer.devices=1

you can change the batch size through command line. e.g

CUDA_VISIBLE_DEVICES=0 python -m train experiment=prot14m/stability trainer.devices=1 dataset.batch_size=128 dataset.batch_size_eval=128

or you can set these parameters in configs/experiment/prot14m/{task}.yaml for specific task.

Finetune on a new downsteam task

To fine-tune on a new task, you need to create new configuration files in the pipeline, experiment, and dataset folders. You can follow the examples we provide in these folders.

For example, if you want to fine-tune a task called fold_class (you can name it anything, here we use {task_name} as a placeholder), you need to create the following files:

experiment/prot14m/{task_name}.yaml
pipeline/{task_name}.yaml
dataset/{task_name}.yaml

In `experiment/prot14m/{task_name}.yaml`:

Change /pipeline: in the defaults section to {task_name}.
Update pretrained_model_path to the correct path on your computer where the pretrained model is located.
Optionally, update the metrics by checking the available ones in src/tasks/metrics.py, or create a new one.

In `pipeline/{task_name}.yaml`:

Change /dataset: in the defaults section to {task_name}.
If your task is at the protein sequence level (where a whole sequence gets a label), use:
```
decoder:
  _name_: nd
  mode: pool
```
If your task is at the residue level (where each amino acid has a label), use:
```
decoder:
  _name_: token
```

In `dataset/{task_name}.yaml`:

Set _name_ and dataset_name to {task_name}.
Set dest_path to the correct path where your data is stored.
Set train_len to the number of training examples.
Create train.csv, valid.csv, and test.csv files in the dest_path directory. These files should have two columns: seq (for the sequence) and label (for the label).

In src/dataloaders/dataset/protein_bench_dataset.py, create new Dataset class

example

class SignalPeptideDataset(Dataset):
    def __init__(
        self,
        split,
        max_length,
        dataset_name="signalP",
        d_output=2, # default binary classification
        dest_path=None,
        tokenizer=None,
        tokenizer_name=None,
        use_padding=True,
        add_eos=False,
        rc_aug=False,
        return_augs=False,
        return_mask=False,
    ):

        self.split = split
        self.max_length = max_length
        self.use_padding = use_padding
        self.tokenizer_name = tokenizer_name
        self.tokenizer = tokenizer
        self.return_augs = return_augs
        self.add_eos = add_eos
        self.d_output = d_output  # needed for decoder to grab
        self.rc_aug = rc_aug
        self.return_mask = return_mask

        # base_path = Path(dest_path)  / split
        csv_file = os.path.join(dest_path, f"{split}.csv")
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sequence = self.data.iloc[idx, 1]
        label = int(self.data.iloc[idx, 0])

        seq = self.tokenizer(sequence,
            add_special_tokens=True if self.add_eos else False,  # this is what controls adding eos
            padding="max_length" if self.use_padding else "do_not_pad",
            max_length=self.max_length,
            truncation=True,
        )
        seq_ids = seq["input_ids"]  # get input_ids

        seq_ids = torch.LongTensor(seq_ids)

        target = torch.LongTensor([label])  # offset by 1, includes eos

        if self.return_mask:
            return seq_ids, target, {'mask': torch.BoolTensor(seq['attention_mask'])}
        else:
            return seq_ids, target

In src/dataloaders/proteomics.py, create new dataloader class and import the Dataset class from src.dataloaders.dataset.protein_bench_dataset

from src.dataloaders.datasets.protein_bench_dataset import SignalPeptideDataset

class SignalPeptide(Prot14M):
    _name_ = "signalP"
    l_output = 0

    def __init__(self, dest_path=None, tokenizer_name=None, dataset_config_name=None, d_output=2, max_length=1024, rc_aug=False,
                 max_length_val=None, max_length_test=None, cache_dir=None, val_ratio=0.0005, val_split_seed=2357,
                 add_eos=True, detokenize=False, val_only=False, batch_size=32, batch_size_eval=None, num_workers=1,
                 shuffle=False, pin_memory=False, drop_last=False, fault_tolerant=False, ddp=False,
                 fast_forward_epochs=None, fast_forward_batches=None,
                total_size=None, remove_tail_ends=False, cutoff_train=0.1, cutoff_test=0.2,
                 *args, **kwargs):
        self.dataset_config_name = dataset_config_name
        self.tokenizer_name = tokenizer_name
        self.rc_aug = rc_aug  # reverse compliment augmentation
        self.dest_path = dest_path
        self.d_output = d_output  # Set this correct

		...

        # Create all splits: torch datasets
        self.dataset_train, self.dataset_val, self.dataset_test = [
            SignalPeptideDataset(split=split,
                            max_length=max_len,
                            dest_path=self.dest_path,
                            d_output=self.d_output,
                            tokenizer=self.tokenizer,  # pass the tokenize wrapper
                            tokenizer_name=self.tokenizer_name,
                            add_eos=self.add_eos,
                            rc_aug=self.rc_aug,
                            )
            for split, max_len in zip(['train', 'test', 'test'], [self.max_length, self.max_length_val, self.max_length_test])
        ]
        return

Make sure that the _name_ matches your specific {task_name}. Set d_output to the number of classes for multi-class datasets, and use d_output = 1 for regression tasks.

Downstream Inference

If you'd like to use our fine-tuned model for downstream analysis (inference), follow our Colab notebook. The notebook is fully integrated with Hugging Face and provides everything you need to:

Load the model and fine-tuned weights.
Run inference on new data.
Extract embeddings from protein sequences.

This notebook serves as a self-contained environment to streamline your workflow for prediction and further analysis.

Citation

Feel free to cite us if you find our work useful :)

@article{zhang2025hyena,
  title={Hyena architecture enables fast and efficient protein language modeling},
  author={Zhang, Yiming and Bian, Bian and Okumura, Manabu},
  journal={IMetaOmics},
  volume={2},
  number={1},
  pages={e45},
  year={2025},
  publisher={Wiley Online Library}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
checkpoint		checkpoint
configs		configs
csrc/fftconv		csrc/fftconv
data		data
evals		evals
flash-attention		flash-attention
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
H20install.md		H20install.md
LICENSE		LICENSE
README.md		README.md
huggingface.py		huggingface.py
my_job.sh		my_job.sh
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtHyena

Important links:

Intro

Dependencies

Pretrain

Fine-tuning

Finetune on a new downsteam task

In `experiment/prot14m/{task_name}.yaml`:

In `pipeline/{task_name}.yaml`:

In `dataset/{task_name}.yaml`:

Downstream Inference

Citation

About

Uh oh!

Releases

Packages

Languages

License

pianconglab/mRNAHyena

Folders and files

Latest commit

History

Repository files navigation

ProtHyena

Important links:

Intro

Dependencies

Pretrain

Fine-tuning

Finetune on a new downsteam task

In experiment/prot14m/{task_name}.yaml:

In pipeline/{task_name}.yaml:

In dataset/{task_name}.yaml:

Downstream Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

In `experiment/prot14m/{task_name}.yaml`:

In `pipeline/{task_name}.yaml`:

In `dataset/{task_name}.yaml`:

Packages