Peptide Design for Microplastic Binding: LSTM-Guided Simulated Annealing

This repository contains the code and data for designing peptide sequences that bind to microplastics using deep learning and optimization techniques. The research focuses on two main tasks: (1) designing promiscuous peptides that bind to all plastic types, and (2) designing selective peptides that bind preferentially to specific plastics (e.g., PP over PET).

Research Overview

Microplastics pose significant environmental and health challenges. This project addresses this issue by computationally designing peptide sequences that can bind to different types of microplastics, including:

PP (Polypropylene)
PET (Polyethylene terephthalate)
PE (Polyethylene)
PVC (Polyvinyl chloride)
Nylon

The binding affinity is quantified using PepBD scores, where lower values indicate stronger binding.

Project Structure

LSTM-SA/
├── Data/                           # Raw peptide sequence data
│   ├── PP.csv, PET.csv, PE.csv, etc.  # CSV files with sequences and PepBD scores
├── onehot_train_val_test/          # Preprocessed one-hot encoded data
├── ScoreModel_Results_onehot/      # Trained model results and checkpoints
├── simulated_annealing/            # Simulated annealing optimization results
├── All_sequences_analysis/         # Compiled scoring results
├── PepBD_surrogate.py             # Main model training and evaluation code
├── simulated_annealing.py         # Simulated annealing optimization algorithm
├── run_simulated_annealing.py     # Script to run optimization experiments
├── run_parallel_sa.py             # Parallel execution of multiple SA runs
├── utils.py                       # Utility functions for data processing
├── Calc_Peptide_Properties.py     # Physicochemical property calculations
├── all_sequences_analysis.py      # Sequence scoring and analysis
├── test_models.py                 # Model testing and validation
└── plot_results.ipynb             # Results visualization and analysis

Key Components

1. Data Processing (`Data/`)

Raw CSV files containing peptide sequences (12-mers) and their PepBD binding scores
Each plastic type has its own dataset with varying sample sizes (114K-572K sequences)

2. Model Training (`PepBD_surrogate.py`)

The core module implements multiple neural network architectures:

LSTM: Long Short-Term Memory networks
BiLSTM: Bidirectional LSTM
GRU: Gated Recurrent Unit
RNN: Vanilla Recurrent Neural Network
Transformer: Attention-based architecture

3. Optimization (`simulated_annealing.py`)

Implements simulated annealing for peptide sequence optimization:

Move operators: Amino acid substitution and position swapping
Objective functions:
- Promiscuous binding (average score across all plastics)
- Selective binding (PP over PET: PP_score - PET_score)
- Single plastic optimization
Cooling schedules: Exponential, linear, logarithmic
Parallel execution: Multiple optimization runs

4. Analysis Tools

all_sequences_analysis.py: Compiles binding scores across all plastics
Calc_Peptide_Properties.py: Calculates physicochemical properties (charge, mass, solubility, patchiness)
plot_results.ipynb: Comprehensive visualization and analysis

Installation and Setup

Prerequisites

pip install torch numpy pandas scikit-learn matplotlib seaborn tqdm

Data Preparation

Ensure CSV files are in the Data/ directory
Run data preprocessing to generate one-hot encodings

Usage

1. Model Training

# Train LSTM model for PP
python PepBD_surrogate.py --plastic_type PP --model_type lstm --epochs 500

# Train multiple models in parallel
parallel -j 12 < jobs.txt  # 12 jobs

2. Simulated Annealing Optimization

# Promiscuous peptide design (bind to all plastics)
python run_simulated_annealing.py \
    --csv-file Data/PP.csv \
    --plastic-type All \
    --n-samples 5 \
    --session-name promiscuous_run

# Selective peptide design (PP over PET)
python run_simulated_annealing.py \
    --csv-file Data/PP.csv \
    --plastic-type PP-PET \
    --n-samples 5 \
    --session-name selective_run

# Parallel execution
python run_parallel_sa.py

3. Sequence Analysis

# Score PepBD sequences across all plastics
python all_sequences_analysis.py \
    --task_type all_plastics \
    --sequence_type pepbd

# Score generated sequences
python all_sequences_analysis.py \
    --task_type PP_PET \
    --sequence_type generated \
    --path simulated_annealing/Results/...

4. Property Calculation

from Calc_Peptide_Properties import calc_properties

# Calculate properties for a peptide sequence
properties = calc_properties("MRHHRIWTAWMW")
print(f"Charge: {properties['charge']}")
print(f"Mass: {properties['mass']}")
print(f"CamSol Score: {properties['camsol']}")

Model Architecture Details

Input Processing

Sequence length: 12 amino acids
Encoding: One-hot encoding (18 amino acids: ADEFGHIKLMNQRSTVWY)
Input shape: (batch_size, 12, 18)

Network Architectures

LSTM/BiLSTM/GRU: Hidden dimensions 128-512, 1-3 layers
Transformer: 4-8 attention heads, 2-6 layers
Output: Single regression value (PepBD score)

Training Parameters

Optimizer: Adam (learning rate: 0.001)
Loss: Mean Squared Error
Batch size: 32
Early stopping: Patience of 10 epochs

Results and Analysis

The optimization process generates:

Optimized sequences with improved binding scores
Trajectory plots showing optimization progress
Sequence analysis including uniqueness and diversity metrics
Cross-plastic scoring for promiscuous and selective peptides

Results are saved in simulated_annealing/Results/ with detailed logging and visualization.

Citation

If you use this code in your research, please cite our work:

@article{peptide_design_2025,
  title={AI-Driven Rational Design of Promiscuous and Selective Plastic-Binding Peptides},
  author={[Vinamr Jain, Michael Bergman, Carol Hall, Fengqi You]},
  journal={[Chemical Science]},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Contact

For questions or issues, please open an issue on GitHub or contact [vj89@cornell.edu].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Peptide Design for Microplastic Binding: LSTM-Guided Simulated Annealing

Research Overview

Project Structure

Key Components

1. Data Processing (`Data/`)

2. Model Training (`PepBD_surrogate.py`)

3. Optimization (`simulated_annealing.py`)

4. Analysis Tools

Installation and Setup

Prerequisites

Data Preparation

Usage

1. Model Training

2. Simulated Annealing Optimization

3. Sequence Analysis

4. Property Calculation

Model Architecture Details

Input Processing

Network Architectures

Training Parameters

Results and Analysis

Citation

License

Contributing

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Data		Data
Results		Results
ScoreModel_Results_onehot		ScoreModel_Results_onehot
shap_results/reports		shap_results/reports
.gitignore		.gitignore
Calc_Peptide_Properties.py		Calc_Peptide_Properties.py
LICENSE		LICENSE
PepBD_surrogate.py		PepBD_surrogate.py
README.md		README.md
all_sequences_analysis.py		all_sequences_analysis.py
generate_shap.py		generate_shap.py
jobs.txt		jobs.txt
perform_shap_analysis.py		perform_shap_analysis.py
plot_results.ipynb		plot_results.ipynb
prom_shap_generation.py		prom_shap_generation.py
requirements.txt		requirements.txt
run_parallel_sa.py		run_parallel_sa.py
run_simulated_annealing.py		run_simulated_annealing.py
selective_shap_generation.py		selective_shap_generation.py
shap_analysis.ipynb		shap_analysis.ipynb
simulated_annealing.py		simulated_annealing.py
surrogate_testing.ipynb		surrogate_testing.ipynb
test_models.py		test_models.py
utils.py		utils.py
visualize_shap_analysis.py		visualize_shap_analysis.py

Folders and files

Latest commit

History

Repository files navigation

Peptide Design for Microplastic Binding: LSTM-Guided Simulated Annealing

Research Overview

Project Structure

Key Components

1. Data Processing (Data/)

2. Model Training (PepBD_surrogate.py)

3. Optimization (simulated_annealing.py)

4. Analysis Tools

Installation and Setup

Prerequisites

Data Preparation

Usage

1. Model Training

2. Simulated Annealing Optimization

3. Sequence Analysis

4. Property Calculation

Model Architecture Details

Input Processing

Network Architectures

Training Parameters

Results and Analysis

Citation

License

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Data Processing (`Data/`)

2. Model Training (`PepBD_surrogate.py`)

3. Optimization (`simulated_annealing.py`)

Packages