Do-Novo: Observed Spectrum Driven Do-Sampling of Theoretical Fragments for De Novo Peptide Sequencing in DIA
Motivation: De novo peptide sequencing is an important prerequisite for downstream proteome analysis, enabling the discovery of biomarkers and therapeutic targets. While several deep learning models attempted to improve architectural design to accommodate the multiplexed data independent acquisition (DIA) data, they overlook explicit modeling of intrinsic factors such as fragment missingness and noise interference that arise from the data generation process.
Results: We propose Do-Novo, which learns do-interventions over theoretical fragments for de novo peptide sequencing in DIA mass spectrometry by parameterizing fragment selection to complement raw spectrum decoding. Our key idea is to train a sampler that selectively identifies informative theoretical fragments directly from the observed spectrum in a two-stage training framework. Extensive experiments demonstrate that Do-Novo achieves state-of-the-art performance in de novo peptide sequencing across three DIA benchmark datasets under both feature-based and feature-free settings. Moreover, additional analyses show that Do-Novo generates biologically valid peptide sequences beyond those identifiable by database search.
This project was developed and tested using the following environment configuration
(as specified in environment.yml):
- Python: 3.11.13
- CUDA: 12.6
- torch: 2.8.0+cu126
- PyTorch Lightning: 2.5.5
- pyteomics: 4.7.5
- pyopenms: 3.3.0
- depthcharge-ms: 0.4.8
Note
These versions are not strict requirements.
The project may work with other versions as well, but the above configuration is provided as a reference for reproducibility.
conda env create --file environment.yml
conda activate do_novo_env
Note
If you encounter issues installing CUDA or PyTorch due to version mismatches, please install PyTorch manually according to your server and CUDA setup. See the official PyTorch installation guide: https://pytorch.org/get-started/previous-versions/
Our datasets include additional PTM annotations (e.g., pyro-Glu, C ammonia-loss) and
require a minor compatibility patch to depthcharge-ms==0.4.8 for correct ProForma parsing
and peptide detokenization.
Specifically, the following changes are required:
- Extend PTM mapping in
depthcharge/primitives.pyto support additional modifications. - Fix detokenization behavior in
depthcharge/tokenizers/peptides.pyto ensure correct sequence reconstruction.
These changes are necessary to avoid ProForma parsing errors (e.g., Missing Closing Tag)
when training or evaluating the model.
We provide a patch file that can be applied after installing the environment.
Run the following commands from the Do-Novo repository root:
# Locate the site-packages directory where `depthcharge` is installed
SITE_PACKAGES=$(python - << 'EOF'
import depthcharge, inspect, os
print(os.path.dirname(os.path.dirname(inspect.getfile(depthcharge))))
EOF
)
# Apply the patch
git -C "$SITE_PACKAGES" apply "$(pwd)/patches/depthcharge_ms.patch"The patch is tested with depthcharge-ms==0.4.8.
Other versions may require minor adjustments.
To verify that the patch was applied successfully:
python -c "import depthcharge.primitives as p; print('E-18.011' in p.MSKB_TO_UNIMOD)"If you prefer to apply the changes manually, please modify the following files in your Python environment:
-
depthcharge/primitives.py- Extend
MSKB_TO_UNIMODwith:"E-18.011": "E[pyro-Glu]""C-17.027": "C[Ammonia-loss]"
- Extend
-
depthcharge/tokenizers/peptides.py- Update
PeptideTokenizer.detokenize()to passjoin=jointosuper().detokenize()and remove the manual string join logic.
- Update
If you want to annotated dataset when training our model,
pip install gdown
gdown https://drive.google.com/uc?id=1LElJGJ9q9y1Q_iyfvtvre4V10mGDO1Y0 # datasets(oc, uti, plasma)
tar -zxvf data.tar.gz
The easiest way to train a model is specify a config file (eg configs/train_oc_ump.yaml) with data, model, and training hyperparameters
python main.py --config-name train_oc_ump.yaml
This code includes modifications based on the code of Cascadia. We are grateful to the authors for providing their code/models as open-source software.
| Name | Affiliation | |
|---|---|---|
| Seungheun Baek† | Data Mining and Information Systems Lab, Korea University, Seoul, South Korea |
sheunbaek@korea.ac.kr |
| Yan Ting Chok | Data Mining and Information Systems Lab, Korea University, Seoul, South Korea |
yanting1412@korea.ac.kr |
| Eunha Lee | Data Mining and Information Systems Lab, Korea University, Seoul, South Korea |
eunhalee@korea.ac.kr |
| Jaewoo Kang* | Data Mining and Information Systems Lab, Korea University, Seoul, South Korea |
kangj@korea.ac.kr |
- †: First Author
- *: Corresponding Author
