MS2Int leverages internal fragment ions to advance peptide tandem mass spectrum prediction
Table of Contents
MS2Int is a deep learning framework that integrates internal fragment ions to enable full-spectrum MS/MS intensity prediction. Built on a bidirectional Mamba state space backbone and trained with Virtual Adversarial Training (VAT), MS2Int jointly predicts terminal b/y ions and internal m ions. Trained on ~7.4 million precursors, MS2Int improves downstream proteomics workflows including DDA rescoring, DIA library search, HLA immunopeptidomics, and phosphosite localization.
- DDA rescoring: Prediction-assisted rescoring for data-dependent acquisition workflows.
- DIA library search: Predicted spectra for spectral library construction and DIA searching.
- HLA immunopeptidomics: Improved identification and rescoring for immunopeptidomics datasets.
- Phosphosite localization: Phosphorylation-site localization support with an FLR QC pipeline.
To get MS2Int up and running locally, follow these steps.
Ensure you have the following before installation:
- Python 3.10+
- GPU with CUDA support (recommended for training and inference acceleration)
- Conda or Miniconda
- Python 3.10+
- PyTorch 2.7+ (tested with 2.7.0; see
environment.ymlor Installation) - mamba-ssm, mamba-ssm2, h5py, numpy, pandas, tqdm, einops
This script follows a locally validated setup path (install causal-conv1d to enable the Mamba2 fast path; Blackwell requires upgrading/restoring Triton):
Step 1: Create conda environment
conda create -n mamba_dev python=3.10 -y
conda activate mamba_devStep 2: Install PyTorch 2.7 (cu128)
pip install --no-cache-dir torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128Step 3: Install core dependencies
pip install --no-cache-dir numpy ninja packagingStep 4: Clone and build mamba_ssm
git clone https://github.com/state-spaces/mamba.git mamba_src
cd mamba_src
git -c safe.directory="$(pwd)" fetch --tags
git -c safe.directory="$(pwd)" checkout v2.3.0
export CUDA_HOME=/usr/local/cuda-12.8
export PATH="$CUDA_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}"
pip install --no-cache-dir --no-build-isolation .Step 5: Install causal-conv1d (enable Mamba2 fast path)
cd ..
git clone https://github.com/Dao-AILab/causal-conv1d.git causal_conv1d_src
cd causal_conv1d_src
git -c safe.directory="$(pwd)" checkout v1.6.0
# Force source build to avoid ABI mismatch between prebuilt wheels and the current torch version
export CAUSAL_CONV1D_FORCE_BUILD=TRUE
pip install --no-cache-dir --no-build-isolation .Step 6: Upgrade/restore Triton (for Blackwell support)
# Note: installing causal-conv1d may pull triton back to torch-pinned 3.3.0; Blackwell requires a newer version
pip install --no-cache-dir --upgrade --force-reinstall triton==3.6.0Step 7: Acceptance test
import torch
from mamba_ssm import Mamba2
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba2(d_model=16, d_state=16, d_conv=4, expand=2, headdim=16, use_mem_eff_path=False).to("cuda")
y = model(x)
assert y.shape == x.shape
print("ACCEPTANCE PASS:", y.shape)If you prefer using docker:
coming soon
Generate the MS2Int inference input H5 (including train_data) from MaxQuant msms.txt and the corresponding mzML, then use the model to write Intpredict.
Step 1: Generate inference input H5 from MaxQuant(data/msms.txt and data/mzml/)
python "spectrum_processing/unmodficaiton/run.py" \
--msms "data/msms.txt" \
--mzml-dir "data/mzml" \
--output "data/MS2Int_input.h5"Step 2: Run MS2Int inference (write predicted intensities into Intpredict)
python "MS2Int/predict.py" \
--ckpt "/mnt/public/lcy/random/B512_L4_vat/model_epoch_99_val_loss_0.1618_0129_135924.pth" \
--input "data/MS2Int_input.h5" \
--output "data/MS2Int_input.h5"Notes:
- The
Fragmentationfield (HCD/CID) reported in MaxQuant's msms.txt may be incorrectly extracted, so we extractFragmentationandcollision_energydirectly from the mzML files instead.
Step 1: Prepare CSV/TSV file with required columns
Demo input (data/demo_input.csv):
Sequence,Length,Charge,collision_energy,Fragmentation
PEPTIDEK,8,2,30,HCD
ALLS[Phospho]LATHK,10,3,27,HCD
[Acetyl]-M[Oxidation]AGLNK,6,2,30,CID
C[Carbamidomethyl]DEFGHIK,8,2,25,HCDStep 2: Run MS2Int inference (CSV input auto-converted to H5)
python "MS2Int/predict.py" \
--ckpt "/mnt/public/lcy/random/B512_L4_vat/model_epoch_99_val_loss_0.1618_0129_135924.pth" \
--input "data/demo_input.csv" \
--output "data/demo_output.h5"Prepare training/Fine-tuning data (extract spectra):
Generate experimental fragment intensities from MaxQuant msms.txt and mzML:
python "spectrum_processing/unmodficaiton/run.py" \
--msms "data/msms.txt" \
--mzml-dir "data/mzml" \
--mode unmodified \
--output "data/training/train.h5"Training from scratch:
python MS2Int/main.py --train_data_path "data/training/train.h5"Fine-tuning:
python MS2Int/fine_tune.py \
--pth "/mnt/public/lcy/random/B512_L4_vat/model_epoch_99_val_loss_0.1618_0129_135924.pth" \
--train_data_path "data/training/train.h5" \
--checkpoint_path "checkpoints/" \
--log_path "logs/train.log"bash "spectrum_processing/rescore/run_pipeline.sh" \
"/path/to/WORKDIR" \
"/path/to/model.pth" \
"mamba_dev"Data directory structure:
data/
├── txt/
│ └── msms.txt
└── mzml/
├── raw1.mzML
└── raw2.mzML
Output is written to data/rescore/; final mokapot results are in data/rescore/mokapot/.
PTM site localization quality control pipeline based on target-decoy spectral similarity and False Localization Rate (FLR) estimation.
Data directory structure (data/MS2Int_flr/):
data/MS2Int_flr/
├── txt/
│ ├── msms.txt # MaxQuant search results
│ └── Phospho (STY)Sites.txt # MaxQuant phosphosite table
└── mzml/
└── raw1.mzML # Raw spectral data
Run the full pipeline:
bash MS2Int_FLR/run_pipeline.sh data/MS2Int_flrOutput files (in data/MS2Int_flr/output/):
output/
├── unique_psm.csv # Unique PSMs
└── phosphosites.csv # Final phosphosites at FLR cutoff
Pre-trained model weights are available for download:
| Model | Description | Download |
|---|---|---|
| MS2Int (Unmodified) | Trained on unmodified peptides (HCD/CID) | Google Drive |
| MS2Int (Phosphorylation) | Fine-tuned for phosphopeptides | Google Drive |
Environment / Dependencies
-
mamba-ssm / causal-conv1d build issues: If compilation fails with
SSLEOFErroror network timeouts, repeat the pip install command - pip will auto-retry. If you needcausal-conv1d(to enable the Mamba2 fast path), it is recommended to compile and install from source following Step 5 in Installation (CAUSAL_CONV1D_FORCE_BUILD=TRUE+--no-build-isolation) to avoid ABI mismatch between prebuilt wheels and the current torch version. Also note it may pull triton back to 3.3.0; Blackwell requires reinstallingtriton==3.6.0. You can also skipcausal-conv1dentirely (MS2Int is compatible withuse_mem_eff_path=False). -
Git safe.directory error: If you see
fatal: detected dubious ownership in repository, usegit -c safe.directory="$(pwd)"prefix for fetch/checkout commands, or add the directory to git safe directories. -
Mamba2 headdim assertion error: If you get
AssertionError: assert self.d_ssm % self.headdim == 0, ensureheaddimdividesd_innerevenly. Ford_model=16, expand=2, useheaddim=16(sinced_inner = 32). -
Triton not supported on Blackwell: If you see
computeCapability not supported' failedwithtarget=cuda:120, upgrade Triton:pip install --upgrade --force-reinstall triton==3.6.0. This conflicts with PyTorch's triton==3.3.0 dependency but is necessary for Blackwell GPUs. -
CUDA version mismatch: When both CUDA 12.8 and 13.0 exist, explicitly set
CUDA_HOME=/usr/local/cuda-12.8(PyTorch 2.7 wheels use CUDA 12.8) before building mamba_ssm. -
FLR pipeline Step3: Requires
pyopenms, install separately viaconda install -c conda-forge pyopenms.
Inference / Training
-
All-zero intensities: During inference, samples with intensity all zeros likely exceed the maximum supported peptide length (≤30 amino acids).
-
Out of Memory (OOM): Reduce batch size.
Coming soon. The manuscript is currently under review. Citation information will be updated upon publication.
Distributed under the MIT License. See LICENSE for more information.
Project Link: https://github.com/cliangX/MS2Int


