Skip to content

zjuKeLiu/RiFold

Repository files navigation

From Sentences to Sequences: Rethinking Languages in Biological System

Introduction

The paradigm of large language models in natural language processing (NLP) has also shown promise in modeling biological languages, including proteins, RNA, and DNA. Both the auto-regressive generation paradigm and evaluation metrics have been transferred from NLP to biological sequence modeling. However, the intrinsic structural correlations in natural and biological languages differ fundamentally. Therefore, we revisit the notion of language in biological systems to better understand how NLP successes can be effectively translated to biological domains. By treating the 3D structure of biomolecules as the semantic content of a sentence and accounting for the strong correlations between residues or bases, we highlight the importance of structural evaluation and demonstrate the applicability of the auto-regressive paradigm in biological language modeling.

TL;DR;

  • Evaluation metrics should be (1) scRMSD, scTM (2) Structure energy.
  • Decoding (generating mechanism) should take the strong long-range context correlation into consideration.

Features

  • Evaluation metrics
    • Support scRMSD, scTM, and structure energy as evaluation metrics.

RiFold Overview
Evaluation pipeline overview

  • RiFold Model
    • Auto-regressive RNA Inverse Folding model.

Installation

To extract the eval.tar.gz file, run:

tar -xzvf eval.tar.gz

Then we have

Vis/
├── e2efold
├── trRosettaRNA_v1.1
└── RhoFold
    ├── infer.sh
    └── calculate_rmsd_ours.py

Here we use the RhoFold as the model in Folding process, which can be replaces with both e2efold and trRosettaRNA_v1.1.

Usage

Run infer.sh to get the folding structure of RNA sequence and the energy of each structure. Then run calculate_rmsd_ours.py to get the structure difference between the designed structure and the native structure.

bash infer.sh
python calculate_rmsd_ours.py

Citation

If you use RiFold or the evaluation pipeline in your research, please cite:

@misc{liu2025sentencessequencesrethinkinglanguages,
      title={From Sentences to Sequences: Rethinking Languages in Biological System}, 
      author={Ke Liu and Shuaike Shen and Hao Chen},
      year={2025},
      eprint={2507.00953},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM},
      url={https://arxiv.org/abs/2507.00953}, 
}

About

RNA inverse folding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages