Do-Novo: Observed Spectrum Driven Do-Sampling of Theoretical Fragments for De Novo Peptide Sequencing in DIA

Abstract

Motivation: De novo peptide sequencing is an important prerequisite for downstream proteome analysis, enabling the discovery of biomarkers and therapeutic targets. While several deep learning models attempted to improve architectural design to accommodate the multiplexed data independent acquisition (DIA) data, they overlook explicit modeling of intrinsic factors such as fragment missingness and noise interference that arise from the data generation process.

Results: We propose Do-Novo, which learns do-interventions over theoretical fragments for de novo peptide sequencing in DIA mass spectrometry by parameterizing fragment selection to complement raw spectrum decoding. Our key idea is to train a sampler that selectively identifies informative theoretical fragments directly from the observed spectrum in a two-stage training framework. Extensive experiments demonstrate that Do-Novo achieves state-of-the-art performance in de novo peptide sequencing across three DIA benchmark datasets under both feature-based and feature-free settings. Moreover, additional analyses show that Do-Novo generates biologically valid peptide sequences beyond those identifiable by database search.

How to run the experiments

Tested Environment

This project was developed and tested using the following environment configuration (as specified in environment.yml):

Python: 3.11.13
CUDA: 12.6
torch: 2.8.0+cu126
PyTorch Lightning: 2.5.5
pyteomics: 4.7.5
pyopenms: 3.3.0
depthcharge-ms: 0.4.8

Note
These versions are not strict requirements.
The project may work with other versions as well, but the above configuration is provided as a reference for reproducibility.

Install environment (Linux)

conda env create --file environment.yml
conda activate do_novo_env

Note
If you encounter issues installing CUDA or PyTorch due to version mismatches, please install PyTorch manually according to your server and CUDA setup. See the official PyTorch installation guide: https://pytorch.org/get-started/previous-versions/

Compatibility note for `depthcharge-ms`

Our datasets include additional PTM annotations (e.g., pyro-Glu, C ammonia-loss) and require a minor compatibility patch to depthcharge-ms==0.4.8 for correct ProForma parsing and peptide detokenization.

Specifically, the following changes are required:

Extend PTM mapping in depthcharge/primitives.py to support additional modifications.
Fix detokenization behavior in depthcharge/tokenizers/peptides.py to ensure correct sequence reconstruction.

These changes are necessary to avoid ProForma parsing errors (e.g., Missing Closing Tag) when training or evaluating the model.

Option 1 (Recommended): Apply patch automatically

We provide a patch file that can be applied after installing the environment.

Run the following commands from the Do-Novo repository root:

# Locate the site-packages directory where `depthcharge` is installed
SITE_PACKAGES=$(python - << 'EOF'
import depthcharge, inspect, os
print(os.path.dirname(os.path.dirname(inspect.getfile(depthcharge))))
EOF
)

# Apply the patch
git -C "$SITE_PACKAGES" apply "$(pwd)/patches/depthcharge_ms.patch"

The patch is tested with depthcharge-ms==0.4.8.
Other versions may require minor adjustments.

To verify that the patch was applied successfully:

python -c "import depthcharge.primitives as p; print('E-18.011' in p.MSKB_TO_UNIMOD)"

Option 2: Manual patch

If you prefer to apply the changes manually, please modify the following files in your Python environment:

depthcharge/primitives.py
- Extend MSKB_TO_UNIMOD with:
  - "E-18.011": "E[pyro-Glu]"
  - "C-17.027": "C[Ammonia-loss]"
depthcharge/tokenizers/peptides.py
- Update PeptideTokenizer.detokenize() to pass join=join to super().detokenize() and remove the manual string join logic.

Download datasets

If you want to annotated dataset when training our model,

pip install gdown
gdown https://drive.google.com/uc?id=1LElJGJ9q9y1Q_iyfvtvre4V10mGDO1Y0 # datasets(oc, uti, plasma)
tar -zxvf data.tar.gz

Training models

The easiest way to train a model is specify a config file (eg configs/train_oc_ump.yaml) with data, model, and training hyperparameters

python main.py --config-name train_oc_ump.yaml

Acknowledgement

This code includes modifications based on the code of Cascadia. We are grateful to the authors for providing their code/models as open-source software.

Contributors

Name	Affiliation	Email
Seungheun Baek†	Data Mining and Information Systems Lab, Korea University, Seoul, South Korea	sheunbaek@korea.ac.kr
Yan Ting Chok	Data Mining and Information Systems Lab, Korea University, Seoul, South Korea	yanting1412@korea.ac.kr
Eunha Lee	Data Mining and Information Systems Lab, Korea University, Seoul, South Korea	eunhalee@korea.ac.kr
Jaewoo Kang*	Data Mining and Information Systems Lab, Korea University, Seoul, South Korea	kangj@korea.ac.kr

†: First Author
*: Corresponding Author

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do-Novo: Observed Spectrum Driven Do-Sampling of Theoretical Fragments for De Novo Peptide Sequencing in DIA

Abstract

How to run the experiments

Tested Environment

Install environment (Linux)

Compatibility note for `depthcharge-ms`

Option 1 (Recommended): Apply patch automatically

Option 2: Manual patch

Download datasets

Training models

Acknowledgement

Contributors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
do_novo		do_novo
figures		figures
patches		patches
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py

License

dmis-lab/Do-Novo

Folders and files

Latest commit

History

Repository files navigation

Do-Novo: Observed Spectrum Driven Do-Sampling of Theoretical Fragments for De Novo Peptide Sequencing in DIA

Abstract

How to run the experiments

Tested Environment

Install environment (Linux)

Compatibility note for depthcharge-ms

Option 1 (Recommended): Apply patch automatically

Option 2: Manual patch

Download datasets

Training models

Acknowledgement

Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Compatibility note for `depthcharge-ms`

Packages