AMP-DFM: Discrete Flow Matching for Multi-Property Antimicrobial Peptides

This repository contains the core code from my MSc thesis, which developed AMP-DFM, a generative model for designing antimicrobial peptides that balance multiple clinical properties.

The model uses discrete flow matching to generate realistic and diverse peptides, guided by classifiers for antimicrobial activity, haemolysis, and cytotoxicity. Generation is steered towards Pareto-optimal trade-offs to produce candidates more likely to succeed in clinical settings.

Other parts of the analysis such as peptide structure prediction, comparison with other models (generative + classifiers) and data collation are omitted from this repository, which contains only the main functions and results.

Setup

Create and activate the conda environment from the provided environment yaml file:

git clone https://github.com/kjayres/AMP-DFM
cd AMP-DFM
conda env create -f documentation/amp-dfm.yaml
conda activate amp-dfm

Analytical Pipeline

The model was developed through the following steps:

Data Preprocessing: Sequences are clustered using MMseqs2 at 80% identity to prevent data leakage. ESM-2 (650M) embeddings are used for classifier training. Sequences are tokenised for the generative models.

# Cluster sequences and assign train/val/test splits
python scripts/data_preprocessing/mmseqs_cluster.py
python scripts/data_preprocessing/assign_cluster_split.py

# Generate ESM-2 embeddings for classifier training
python scripts/data_preprocessing/generate_embeddings.py

# Prepare tokenised datasets for generative models
python scripts/data_preprocessing/prepare_ampdfm_uncond_dataset.py
python scripts/data_preprocessing/prepare_ampdfm_cond_dataset.py

Classifier Training: XGBoost classifiers are trained on ESM-2 embeddings to predict antimicrobial activity (generic and organism-specific), haemolysis, and cytotoxicity.

# Main classifiers
python scripts/classifiers/train_classifiers.py \
    --config configs/classifiers/antimicrobial_activity_generic_xgboost.yaml
python scripts/classifiers/train_classifiers.py \
    --config configs/classifiers/haemolysis_xgboost.yaml
python scripts/classifiers/train_classifiers.py \
    --config configs/classifiers/cytotoxicity_xgboost.yaml

For organism-specific antimicrobial activity classifiers, we subset the activity dataset to assays tested against the organism of interest. Training proceeds in the same way.

Model Training: A time-conditioned CNN is trained to estimate transition probabilities along a mixture path that evolves sequences from a uniform distribution towards data through single-position edits. Training minimises the generalised KL divergence between the teacher posterior (the conditional distribution over single‑position edits given the current sequence and time) and the model’s predicted edit distribution, which allows for novel peptide generation.

# Unconditional training
python scripts/dfm/ampdfm_unconditional.py \
    --config configs/flow_matching/ampdfm_unconditional.yaml

# Conditional fine-tuning (optional)
python scripts/dfm/ampdfm_conditional_finetune.py \
    --config configs/flow_matching/ampdfm_conditional_finetune.yaml

# Unguided sampling
python scripts/dfm/ampdfm_uncond_sample.py \
    --config configs/flow_matching/ampdfm_uncond_sample.yaml

Multi-Objective Guidance: During sampling, classifiers score single-position edit candidates across the three objectives. Proposals are reweighted using importance weights, penalised for homopolymer formation, and sampled via Euler jumps weighted by the guided transition rates.

python scripts/mog/ampdfm_mog.py --config configs/mog/ampdfm_mog_generic.yaml

For organism-specific variants (ecoli, paeruginosa, saureus), we simply call the organism-specific activity classifier for scoring instead.

Generation Parameters

The generation process can be customised through adjustment of the following parameters:

Config file parameters:

amp_variant: Target organism (generic, ecoli, paeruginosa, saureus)
importance: Weighting for each objective [antimicrobial, haemolysis, cytotoxicity].
homopolymer_gamma: Penalty strength for homopolymer sequences to avoid repetitive patterns
n_samples: Total number of peptides to generate
n_batches: Number of batches to split generation into
len_min and len_max: Peptide length range
seq_length (optional): Fixes sequence length (overrides len_min/len_max when set)

Command-line options:

--T: Number of sampling steps
--beta: Guidance reweighting scale
--lambda_: Trade‑off for directional score vs average rank
--Phi_init, --Phi_min, --Phi_max: Hypercone angle (radians)
--tau, --alpha_r, --eta: Adaptation controls for the hypercone angle (EMA target and update rate)
--num_div: Simplex discretisation for importance weight vectors

Example Usage:

python scripts/mog/ampdfm_mog.py \
  --config configs/mog/ampdfm_mog_generic.yaml \
  --T 150 \
  --beta 2.0 \
  --lambda_ 1.0 \
  --Phi_init 0.785 \
  --Phi_min 0.262 \
  --Phi_max 1.309 \
  --tau 0.3 \
  --alpha_r 0.5 \
  --eta 1.0 \
  --num_div 64

Outputs

Checkpoints

DFM model checkpoints are saved under checkpoints/dfm/:
- Unconditional model: checkpoints/dfm/ampdfm_unconditional_epoch200.ckpt
- Conditional fine-tuned model: checkpoints/dfm/ampdfm_conditional_finetuned.ckpt
Classifier checkpoints are saved under checkpoints/classifiers/:
- Antimicrobial activity (organism-specific): checkpoints/classifiers/antimicrobial_activity/<variant>/model.json with metadata.pkl
- Haemolysis and cytotoxicity: checkpoints/classifiers/<task>/model.json with metadata.pkl

Peptides

Generated peptides are saved as fasta and CSV files which contain scores/probabilities provided by the classifiers:

Guided (MOG):
- FASTA: outputs/peptides/<variant>/<run_name>.fa
- CSV: outputs/peptides/<variant>/<run_name>_scores.csv
Unguided:
- FASTA: outputs/peptides/unguided/unconditional_samples.fa
- CSV: outputs/peptides/unguided/unconditional_samples_scores.csv

Citation

The code for this repo and the generative model is largely based on the work of Chen et al. and Lipman et al. The design of the antimicrobial activity classifiers is based on the work of Soares et al. and Szymczak et al. The design of the haemolysis classifier is based on the work of Capecchi et al.

If this code is of any use, you may be interested in the relevant papers:

@misc{chen2025multiobjectiveguideddiscreteflowmatching,
      title={Multi-Objective-Guided Discrete Flow Matching for Controllable Biological Sequence Design}, 
      author={Tong Chen and Yinuo Zhang and Sophia Tang and Pranam Chatterjee},
      year={2025},
      eprint={2505.07086},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.07086}
}

@misc{lipman2024flowmatchingguidecode,
      title={Flow Matching Guide and Code}, 
      author={Yaron Lipman and Marton Havasi and Peter Holderrieth and Neta Shaul and Matt Le and Brian Karrer and Ricky T. Q. Chen and David Lopez-Paz and Heli Ben-Hamu and Itai Gat},
      year={2024},
      eprint={2412.06264},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.06264}
}

@misc{soares2025targetedampgenerationcontrolled,
      title={Targeted AMP generation through controlled diffusion with efficient embeddings}, 
      author={Diogo Soares and Leon Hetzel and Paulina Szymczak and Fabian Theis and Stephan Günnemann and Ewa Szczurek},
      year={2025},
      eprint={2504.17247},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.17247}
}

@article{szymczak2023discovering,
      title={Discovering highly potent antimicrobial peptides with deep generative model HydrAMP},
      author={Szymczak, Paulina and Możejko, Marcin and Grzegorzek, Tomasz and Bauer, Radosław and Neubauer, Damian and Michalski, Mateusz and Sroka, Jacek and Setny, Piotr and Kamysz, Wojciech and Szczurek, Ewa},
      journal={Nature Communications},
      volume={14},
      number={1},
      pages={1453},
      year={2023},
      publisher={Nature Publishing Group},
      doi={10.1038/s41467-023-36994-z},
      url={https://doi.org/10.1038/s41467-023-36994-z}
}

@article{capecchi2021machine,
      title={Machine learning designs non-hemolytic antimicrobial peptides},
      author={Capecchi, Alice and Cai, Xingguang and Personne, Hippolyte and Köhler, Thilo and van Delden, Christian and Reymond, Jean-Louis},
      journal={Chemical Science},
      year={2021},
      volume={12},
      number={26},
      pages={9221--9232},
      publisher={The Royal Society of Chemistry},
      doi={10.1039/D1SC01713F},
      url={http://dx.doi.org/10.1039/D1SC01713F}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMP-DFM: Discrete Flow Matching for Multi-Property Antimicrobial Peptides

Setup

Analytical Pipeline

Generation Parameters

Outputs

Checkpoints

Peptides

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
checkpoints		checkpoints
configs		configs
data		data
documentation		documentation
outputs		outputs
scripts		scripts
src/ampdfm		src/ampdfm
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

AMP-DFM: Discrete Flow Matching for Multi-Property Antimicrobial Peptides

Setup

Analytical Pipeline

Generation Parameters

Outputs

Checkpoints

Peptides

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages