This repository contains the code and data for designing peptide sequences that bind to microplastics using deep learning and optimization techniques. The research focuses on two main tasks: (1) designing promiscuous peptides that bind to all plastic types, and (2) designing selective peptides that bind preferentially to specific plastics (e.g., PP over PET).
Microplastics pose significant environmental and health challenges. This project addresses this issue by computationally designing peptide sequences that can bind to different types of microplastics, including:
- PP (Polypropylene)
- PET (Polyethylene terephthalate)
- PE (Polyethylene)
- PVC (Polyvinyl chloride)
- Nylon
The binding affinity is quantified using PepBD scores, where lower values indicate stronger binding.
LSTM-SA/
├── Data/ # Raw peptide sequence data
│ ├── PP.csv, PET.csv, PE.csv, etc. # CSV files with sequences and PepBD scores
├── onehot_train_val_test/ # Preprocessed one-hot encoded data
├── ScoreModel_Results_onehot/ # Trained model results and checkpoints
├── simulated_annealing/ # Simulated annealing optimization results
├── All_sequences_analysis/ # Compiled scoring results
├── PepBD_surrogate.py # Main model training and evaluation code
├── simulated_annealing.py # Simulated annealing optimization algorithm
├── run_simulated_annealing.py # Script to run optimization experiments
├── run_parallel_sa.py # Parallel execution of multiple SA runs
├── utils.py # Utility functions for data processing
├── Calc_Peptide_Properties.py # Physicochemical property calculations
├── all_sequences_analysis.py # Sequence scoring and analysis
├── test_models.py # Model testing and validation
└── plot_results.ipynb # Results visualization and analysis
- Raw CSV files containing peptide sequences (12-mers) and their PepBD binding scores
- Each plastic type has its own dataset with varying sample sizes (114K-572K sequences)
The core module implements multiple neural network architectures:
- LSTM: Long Short-Term Memory networks
- BiLSTM: Bidirectional LSTM
- GRU: Gated Recurrent Unit
- RNN: Vanilla Recurrent Neural Network
- Transformer: Attention-based architecture
Implements simulated annealing for peptide sequence optimization:
- Move operators: Amino acid substitution and position swapping
- Objective functions:
- Promiscuous binding (average score across all plastics)
- Selective binding (PP over PET: PP_score - PET_score)
- Single plastic optimization
- Cooling schedules: Exponential, linear, logarithmic
- Parallel execution: Multiple optimization runs
all_sequences_analysis.py: Compiles binding scores across all plasticsCalc_Peptide_Properties.py: Calculates physicochemical properties (charge, mass, solubility, patchiness)plot_results.ipynb: Comprehensive visualization and analysis
pip install torch numpy pandas scikit-learn matplotlib seaborn tqdm- Ensure CSV files are in the
Data/directory - Run data preprocessing to generate one-hot encodings
# Train LSTM model for PP
python PepBD_surrogate.py --plastic_type PP --model_type lstm --epochs 500
# Train multiple models in parallel
parallel -j 12 < jobs.txt # 12 jobs# Promiscuous peptide design (bind to all plastics)
python run_simulated_annealing.py \
--csv-file Data/PP.csv \
--plastic-type All \
--n-samples 5 \
--session-name promiscuous_run
# Selective peptide design (PP over PET)
python run_simulated_annealing.py \
--csv-file Data/PP.csv \
--plastic-type PP-PET \
--n-samples 5 \
--session-name selective_run
# Parallel execution
python run_parallel_sa.py# Score PepBD sequences across all plastics
python all_sequences_analysis.py \
--task_type all_plastics \
--sequence_type pepbd
# Score generated sequences
python all_sequences_analysis.py \
--task_type PP_PET \
--sequence_type generated \
--path simulated_annealing/Results/...from Calc_Peptide_Properties import calc_properties
# Calculate properties for a peptide sequence
properties = calc_properties("MRHHRIWTAWMW")
print(f"Charge: {properties['charge']}")
print(f"Mass: {properties['mass']}")
print(f"CamSol Score: {properties['camsol']}")- Sequence length: 12 amino acids
- Encoding: One-hot encoding (18 amino acids: ADEFGHIKLMNQRSTVWY)
- Input shape: (batch_size, 12, 18)
- LSTM/BiLSTM/GRU: Hidden dimensions 128-512, 1-3 layers
- Transformer: 4-8 attention heads, 2-6 layers
- Output: Single regression value (PepBD score)
- Optimizer: Adam (learning rate: 0.001)
- Loss: Mean Squared Error
- Batch size: 32
- Early stopping: Patience of 10 epochs
The optimization process generates:
- Optimized sequences with improved binding scores
- Trajectory plots showing optimization progress
- Sequence analysis including uniqueness and diversity metrics
- Cross-plastic scoring for promiscuous and selective peptides
Results are saved in simulated_annealing/Results/ with detailed logging and visualization.
If you use this code in your research, please cite our work:
@article{peptide_design_2025,
title={AI-Driven Rational Design of Promiscuous and Selective Plastic-Binding Peptides},
author={[Vinamr Jain, Michael Bergman, Carol Hall, Fengqi You]},
journal={[Chemical Science]},
year={2025}
}This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For questions or issues, please open an issue on GitHub or contact [vj89@cornell.edu].