Skip to content

PEESEgroup/LSTM-SA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Peptide Design for Microplastic Binding: LSTM-Guided Simulated Annealing

This repository contains the code and data for designing peptide sequences that bind to microplastics using deep learning and optimization techniques. The research focuses on two main tasks: (1) designing promiscuous peptides that bind to all plastic types, and (2) designing selective peptides that bind preferentially to specific plastics (e.g., PP over PET).

Research Overview

Microplastics pose significant environmental and health challenges. This project addresses this issue by computationally designing peptide sequences that can bind to different types of microplastics, including:

  • PP (Polypropylene)
  • PET (Polyethylene terephthalate)
  • PE (Polyethylene)
  • PVC (Polyvinyl chloride)
  • Nylon

The binding affinity is quantified using PepBD scores, where lower values indicate stronger binding.

Project Structure

LSTM-SA/
├── Data/                           # Raw peptide sequence data
│   ├── PP.csv, PET.csv, PE.csv, etc.  # CSV files with sequences and PepBD scores
├── onehot_train_val_test/          # Preprocessed one-hot encoded data
├── ScoreModel_Results_onehot/      # Trained model results and checkpoints
├── simulated_annealing/            # Simulated annealing optimization results
├── All_sequences_analysis/         # Compiled scoring results
├── PepBD_surrogate.py             # Main model training and evaluation code
├── simulated_annealing.py         # Simulated annealing optimization algorithm
├── run_simulated_annealing.py     # Script to run optimization experiments
├── run_parallel_sa.py             # Parallel execution of multiple SA runs
├── utils.py                       # Utility functions for data processing
├── Calc_Peptide_Properties.py     # Physicochemical property calculations
├── all_sequences_analysis.py      # Sequence scoring and analysis
├── test_models.py                 # Model testing and validation
└── plot_results.ipynb             # Results visualization and analysis

Key Components

1. Data Processing (Data/)

  • Raw CSV files containing peptide sequences (12-mers) and their PepBD binding scores
  • Each plastic type has its own dataset with varying sample sizes (114K-572K sequences)

2. Model Training (PepBD_surrogate.py)

The core module implements multiple neural network architectures:

  • LSTM: Long Short-Term Memory networks
  • BiLSTM: Bidirectional LSTM
  • GRU: Gated Recurrent Unit
  • RNN: Vanilla Recurrent Neural Network
  • Transformer: Attention-based architecture

3. Optimization (simulated_annealing.py)

Implements simulated annealing for peptide sequence optimization:

  • Move operators: Amino acid substitution and position swapping
  • Objective functions:
    • Promiscuous binding (average score across all plastics)
    • Selective binding (PP over PET: PP_score - PET_score)
    • Single plastic optimization
  • Cooling schedules: Exponential, linear, logarithmic
  • Parallel execution: Multiple optimization runs

4. Analysis Tools

  • all_sequences_analysis.py: Compiles binding scores across all plastics
  • Calc_Peptide_Properties.py: Calculates physicochemical properties (charge, mass, solubility, patchiness)
  • plot_results.ipynb: Comprehensive visualization and analysis

Installation and Setup

Prerequisites

pip install torch numpy pandas scikit-learn matplotlib seaborn tqdm

Data Preparation

  1. Ensure CSV files are in the Data/ directory
  2. Run data preprocessing to generate one-hot encodings

Usage

1. Model Training

# Train LSTM model for PP
python PepBD_surrogate.py --plastic_type PP --model_type lstm --epochs 500

# Train multiple models in parallel
parallel -j 12 < jobs.txt  # 12 jobs

2. Simulated Annealing Optimization

# Promiscuous peptide design (bind to all plastics)
python run_simulated_annealing.py \
    --csv-file Data/PP.csv \
    --plastic-type All \
    --n-samples 5 \
    --session-name promiscuous_run

# Selective peptide design (PP over PET)
python run_simulated_annealing.py \
    --csv-file Data/PP.csv \
    --plastic-type PP-PET \
    --n-samples 5 \
    --session-name selective_run

# Parallel execution
python run_parallel_sa.py

3. Sequence Analysis

# Score PepBD sequences across all plastics
python all_sequences_analysis.py \
    --task_type all_plastics \
    --sequence_type pepbd

# Score generated sequences
python all_sequences_analysis.py \
    --task_type PP_PET \
    --sequence_type generated \
    --path simulated_annealing/Results/...

4. Property Calculation

from Calc_Peptide_Properties import calc_properties

# Calculate properties for a peptide sequence
properties = calc_properties("MRHHRIWTAWMW")
print(f"Charge: {properties['charge']}")
print(f"Mass: {properties['mass']}")
print(f"CamSol Score: {properties['camsol']}")

Model Architecture Details

Input Processing

  • Sequence length: 12 amino acids
  • Encoding: One-hot encoding (18 amino acids: ADEFGHIKLMNQRSTVWY)
  • Input shape: (batch_size, 12, 18)

Network Architectures

  • LSTM/BiLSTM/GRU: Hidden dimensions 128-512, 1-3 layers
  • Transformer: 4-8 attention heads, 2-6 layers
  • Output: Single regression value (PepBD score)

Training Parameters

  • Optimizer: Adam (learning rate: 0.001)
  • Loss: Mean Squared Error
  • Batch size: 32
  • Early stopping: Patience of 10 epochs

Results and Analysis

The optimization process generates:

  1. Optimized sequences with improved binding scores
  2. Trajectory plots showing optimization progress
  3. Sequence analysis including uniqueness and diversity metrics
  4. Cross-plastic scoring for promiscuous and selective peptides

Results are saved in simulated_annealing/Results/ with detailed logging and visualization.

Citation

If you use this code in your research, please cite our work:

@article{peptide_design_2025,
  title={AI-Driven Rational Design of Promiscuous and Selective Plastic-Binding Peptides},
  author={[Vinamr Jain, Michael Bergman, Carol Hall, Fengqi You]},
  journal={[Chemical Science]},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Contact

For questions or issues, please open an issue on GitHub or contact [vj89@cornell.edu].

About

AI guided rational design of promiscuous and selective binding peptides

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors