MS2Int/README.md at main · cliangX/MS2Int

MS2Int leverages internal fragment ions to advance peptide tandem mass spectrum prediction

🌐 Web Server: ms2int.com

Table of Contents

About The Project
- Applications
Getting Started
Usage
Troubleshooting
Citation
License
Contact

About The Project

MS2Int is a deep learning framework that integrates internal fragment ions to enable full-spectrum MS/MS intensity prediction. Built on a bidirectional Mamba state space backbone and trained with Virtual Adversarial Training (VAT), MS2Int jointly predicts terminal b/y ions and internal m ions. Trained on ~7.4 million precursors, MS2Int improves downstream proteomics workflows including DDA rescoring, DIA library search, HLA immunopeptidomics, and phosphosite localization.

(back to top)

Applications

DDA rescoring: Prediction-assisted rescoring for data-dependent acquisition workflows.
DIA library search: Predicted spectra for spectral library construction and DIA searching.
HLA immunopeptidomics: Improved identification and rescoring for immunopeptidomics datasets.
Phosphosite localization: Phosphorylation-site localization support with an FLR QC pipeline.

(back to top)

Getting Started

To get MS2Int up and running locally, follow these steps.

Prerequisites

Ensure you have the following before installation:

Python 3.10+
GPU with CUDA support (recommended for training and inference acceleration)
Conda or Miniconda

Dependencies

Python 3.10+
PyTorch 2.7+ (tested with 2.7.0; see environment.yml or Installation)
mamba-ssm, mamba-ssm2, h5py, numpy, pandas, tqdm, einops

Installation

From Scratch

This script follows a locally validated setup path (install causal-conv1d to enable the Mamba2 fast path; Blackwell requires upgrading/restoring Triton):

Step 1: Create conda environment

conda create -n mamba_dev python=3.10 -y
conda activate mamba_dev

Step 2: Install PyTorch 2.7 (cu128)

pip install --no-cache-dir torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128

Step 3: Install core dependencies

pip install --no-cache-dir numpy ninja packaging

Step 4: Clone and build mamba_ssm

git clone https://github.com/state-spaces/mamba.git mamba_src
cd mamba_src
git -c safe.directory="$(pwd)" fetch --tags
git -c safe.directory="$(pwd)" checkout v2.3.0

export CUDA_HOME=/usr/local/cuda-12.8
export PATH="$CUDA_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}"
pip install --no-cache-dir --no-build-isolation .

Step 5: Install causal-conv1d (enable Mamba2 fast path)

cd ..
git clone https://github.com/Dao-AILab/causal-conv1d.git causal_conv1d_src
cd causal_conv1d_src
git -c safe.directory="$(pwd)" checkout v1.6.0

# Force source build to avoid ABI mismatch between prebuilt wheels and the current torch version
export CAUSAL_CONV1D_FORCE_BUILD=TRUE
pip install --no-cache-dir --no-build-isolation .

Step 6: Upgrade/restore Triton (for Blackwell support)

# Note: installing causal-conv1d may pull triton back to torch-pinned 3.3.0; Blackwell requires a newer version
pip install --no-cache-dir --upgrade --force-reinstall triton==3.6.0

Step 7: Acceptance test

import torch
from mamba_ssm import Mamba2
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba2(d_model=16, d_state=16, d_conv=4, expand=2, headdim=16, use_mem_eff_path=False).to("cuda")
y = model(x)
assert y.shape == x.shape
print("ACCEPTANCE PASS:", y.shape)

Alternative: Using docker

If you prefer using docker:

coming soon

(back to top)

Usage

1) Inference (from MaxQuant)

Generate the MS2Int inference input H5 (including train_data) from MaxQuant msms.txt and the corresponding mzML, then use the model to write Intpredict.

Step 1: Generate inference input H5 from MaxQuant(data/msms.txt and data/mzml/)

python "spectrum_processing/unmodficaiton/run.py" \
  --msms "data/msms.txt" \
  --mzml-dir "data/mzml" \
  --output "data/MS2Int_input.h5"

Step 2: Run MS2Int inference (write predicted intensities into Intpredict)

python "MS2Int/predict.py" \
  --ckpt "/mnt/public/lcy/random/B512_L4_vat/model_epoch_99_val_loss_0.1618_0129_135924.pth" \
  --input "data/MS2Int_input.h5" \
  --output "data/MS2Int_input.h5"

Notes:

The Fragmentation field (HCD/CID) reported in MaxQuant's msms.txt may be incorrectly extracted, so we extract Fragmentation and collision_energy directly from the mzML files instead.

2) Inference (from CSV/TSV)

Step 1: Prepare CSV/TSV file with required columns

Demo input (data/demo_input.csv):

Sequence,Length,Charge,collision_energy,Fragmentation
PEPTIDEK,8,2,30,HCD
ALLS[Phospho]LATHK,10,3,27,HCD
[Acetyl]-M[Oxidation]AGLNK,6,2,30,CID
C[Carbamidomethyl]DEFGHIK,8,2,25,HCD

Step 2: Run MS2Int inference (CSV input auto-converted to H5)

python "MS2Int/predict.py" \
  --ckpt "/mnt/public/lcy/random/B512_L4_vat/model_epoch_99_val_loss_0.1618_0129_135924.pth" \
  --input "data/demo_input.csv" \
  --output "data/demo_output.h5"

3) Training / Fine-tuning / PTM Fine-tuning

Prepare training/Fine-tuning data (extract spectra):

Generate experimental fragment intensities from MaxQuant msms.txt and mzML:

python "spectrum_processing/unmodficaiton/run.py" \
  --msms "data/msms.txt" \
  --mzml-dir "data/mzml" \
  --mode unmodified \
  --output "data/training/train.h5"

Training from scratch:

python MS2Int/main.py --train_data_path "data/training/train.h5"

Fine-tuning:

python MS2Int/fine_tune.py \
  --pth "/mnt/public/lcy/random/B512_L4_vat/model_epoch_99_val_loss_0.1618_0129_135924.pth" \
  --train_data_path "data/training/train.h5" \
  --checkpoint_path "checkpoints/" \
  --log_path "logs/train.log"

4) Rescore

bash "spectrum_processing/rescore/run_pipeline.sh" \
  "/path/to/WORKDIR" \
  "/path/to/model.pth" \
  "mamba_dev"

Data directory structure:

data/
├── txt/
│   └── msms.txt
└── mzml/
    ├── raw1.mzML
    └── raw2.mzML

Output is written to data/rescore/; final mokapot results are in data/rescore/mokapot/.

5) MS2Int_flr

PTM site localization quality control pipeline based on target-decoy spectral similarity and False Localization Rate (FLR) estimation.

Data directory structure (data/MS2Int_flr/):

data/MS2Int_flr/
├── txt/
│   ├── msms.txt                    # MaxQuant search results
│   └── Phospho (STY)Sites.txt      # MaxQuant phosphosite table
└── mzml/
    └── raw1.mzML                   # Raw spectral data

Run the full pipeline:

bash MS2Int_FLR/run_pipeline.sh data/MS2Int_flr

Output files (in data/MS2Int_flr/output/):

output/
├── unique_psm.csv       # Unique PSMs
└── phosphosites.csv     # Final phosphosites at FLR cutoff

Model Weights & Data

Pre-trained model weights are available for download:

Model	Description	Download
MS2Int (Unmodified)	Trained on unmodified peptides (HCD/CID)	Google Drive
MS2Int (Phosphorylation)	Fine-tuned for phosphopeptides	Google Drive

(back to top)

Troubleshooting

Common Issues

Environment / Dependencies

mamba-ssm / causal-conv1d build issues: If compilation fails with SSLEOFError or network timeouts, repeat the pip install command - pip will auto-retry. If you need causal-conv1d (to enable the Mamba2 fast path), it is recommended to compile and install from source following Step 5 in Installation (CAUSAL_CONV1D_FORCE_BUILD=TRUE + --no-build-isolation) to avoid ABI mismatch between prebuilt wheels and the current torch version. Also note it may pull triton back to 3.3.0; Blackwell requires reinstalling triton==3.6.0. You can also skip causal-conv1d entirely (MS2Int is compatible with use_mem_eff_path=False).
Git safe.directory error: If you see fatal: detected dubious ownership in repository, use git -c safe.directory="$(pwd)" prefix for fetch/checkout commands, or add the directory to git safe directories.
Mamba2 headdim assertion error: If you get AssertionError: assert self.d_ssm % self.headdim == 0, ensure headdim divides d_inner evenly. For d_model=16, expand=2, use headdim=16 (since d_inner = 32).
Triton not supported on Blackwell: If you see computeCapability not supported' failed with target=cuda:120, upgrade Triton: pip install --upgrade --force-reinstall triton==3.6.0. This conflicts with PyTorch's triton==3.3.0 dependency but is necessary for Blackwell GPUs.
CUDA version mismatch: When both CUDA 12.8 and 13.0 exist, explicitly set CUDA_HOME=/usr/local/cuda-12.8 (PyTorch 2.7 wheels use CUDA 12.8) before building mamba_ssm.
FLR pipeline Step3: Requires pyopenms, install separately via conda install -c conda-forge pyopenms.

Inference / Training

All-zero intensities: During inference, samples with intensity all zeros likely exceed the maximum supported peptide length (≤30 amino acids).
Out of Memory (OOM): Reduce batch size.

(back to top)

Citation

Coming soon. The manuscript is currently under review. Citation information will be updated upon publication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About The Project

Applications

Getting Started

Prerequisites

Dependencies

Installation

From Scratch

Alternative: Using docker

Usage

1) Inference (from MaxQuant)

2) Inference (from CSV/TSV)

3) Training / Fine-tuning / PTM Fine-tuning

4) Rescore

5) MS2Int_flr

Model Weights & Data

Troubleshooting

Common Issues

Citation

License

Contact

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

About The Project

Applications

Getting Started

Prerequisites

Dependencies

Installation

From Scratch

Alternative: Using docker

Usage

1) Inference (from MaxQuant)

2) Inference (from CSV/TSV)

3) Training / Fine-tuning / PTM Fine-tuning

4) Rescore

5) MS2Int_flr

Model Weights & Data

Troubleshooting

Common Issues

Citation

License

Contact